Hi Akira, Karl

Thanks for your help in sorting out how  \pdfglyphtounicode  allows access to upper-plane code-points.

However, I think there is still a problem in how  pdftex  constructs the  /ToUnicode CMap.
This concerns characters with glyph names that use the ‘.’ qualifying construction;
e.g.
   a.sc,  b.sc , …  aacute.sc , W.alt , Theta.var1 ,  etc.

It seems that there can be entries for these glyph names within the database,
but those entries are never recovered to be written into the CMap.

This is because of the following coding:   tounicode.sty  lines 187 onwards:

>> /* this function set proper values to *gp based on s; in case it returns
>>  * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing
>>  * gp->unicode_seq too */
>> static void set_glyph_unicode(char *s, glyph_unicode_entry * gp)
>> {
>>  ...
>>
>>     /* strip everything after the first dot */
>>     p = strchr(s, '.');
>>     if (p != NULL) {
>>         *buf = 0;
>>         strncat(buf, s, p - s);
>>         s = buf;
>>     }
>>
>>  ...


The origin of this coding is surely Adobe’s stated way to establish a default
for which character to select for Copy/Paste, Searching, etc.
     *** when there is no guidance from a CMap or  /ActualText entry. ***

However, pdftex  is making it impossible to set such CMap entries for glyphs
with qualified names involving the ‘.’ character.

In short,  \pdfglyphtounicode  allows replacement Unicode strings to be entered
into the glyph-name database, but …

  …  set_glyph_unicode  never uses those entries,
replacing them instead with the unqualified glyph name.


The attached file explores this using the  libertine-type1.sty  package.

 


(Make sure libertine.map is enabled, to use this example.)


My suggestion for altering  tounicode.c , within the  set_glyph_unicode  function block,
is to test the full name (including ‘.’s) first, for a datbase entry.
If found, use it.  Otherwise, try again using just the prefix (as at present).

Or in case a name is multiply qualified; e.g.,
          delta.sc.ipa         (occurs in  cmu-tipx.enc )   also    omega.sc.ipa   q.sc.ipa   f.sc.ipa
then drop off the qualifications from the end.
So test in order:   delta.sc.ipa   delta.sc   delta


Without a fix of this sort, the true small-cap characters that are in Unicode
can never be properly addressed, for archival/accessibility considerations,
as well as Copy/Paste.
Such characters occur within blocks:
   U+025A — U+02FF  IPA Extensions
   U+1D00 — U+1D7F  Phonetic Extensions
   U+A720 — U+A7FF  Latin Extended-D
   U+FE50 — U+FE6F  Small Form Variants

And of course there are superiors and inferiors in other blocks, which also
are affected, when glyph names are used, such as:
   i.superior   n.superior 
   /zero.inferior /one.inferior   etc.
as is very commonly used in fonts.



Cheers

        Ross


Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore@mq.edu.au

http://www.maths.mq.edu.au


 <http://mq.edu.au/>



CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.