Hi Akira, Karl

Thanks for your help in sorting out how \pdfglyphtounicode allows access to upper-plane code-points.

However, I think there is still a problem in how pdftex constructs the /ToUnicode CMap.
This concerns characters with glyph names that use the ‘.’ qualifying construction;
e.g.
   a.sc, b.sc , … aacute.sc , W.alt , Theta.var1 , etc.

It seems that there can be entries for these glyph names within the database,
but those entries are never recovered to be written into the CMap.

This is because of the following coding:   tounicode.sty lines 187 onwards:

>> /* this function set proper values to *gp based on s; in case it returns
>> * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing
>> * gp->unicode_seq too */
>> static void set_glyph_unicode(char *s, glyph_unicode_entry * gp)
>> {
>> ...
>>
>>     /* strip everything after the first dot */
>>     p = strchr(s, '.');
>>     if (p != NULL) {
>>         *buf = 0;
>>         strncat(buf, s, p - s);
>>         s = buf;
>>     }
>>
>> ...

The origin of this coding is surely Adobe’s stated way to establish a default
for which character to select for Copy/Paste, Searching, etc.
     *** when there is no guidance from a CMap or /ActualText entry. ***

However, pdftex is making it impossible to set such CMap entries for glyphs
with qualified names involving the ‘.’ character.

In short, \pdfglyphtounicode allows replacement Unicode strings to be entered
into the glyph-name database, but …

… set_glyph_unicode never uses those entries,
replacing them instead with the unqualified glyph name.

The attached file explores this using the libertine-type1.sty package.

(Make sure libertine.map is enabled, to use this example.)

My suggestion for altering tounicode.c , within the set_glyph_unicode function block,
is to test the full name (including ‘.’s) first, for a datbase entry.
If found, use it. Otherwise, try again using just the prefix (as at present).

Or in case a name is multiply qualified; e.g.,
          delta.sc.ipa         (occurs in cmu-tipx.enc )   also    omega.sc.ipa   q.sc.ipa   f.sc.ipa
then drop off the qualifications from the end.
So test in order:   delta.sc.ipa   delta.sc   delta

Without a fix of this sort, the true small-cap characters that are in Unicode
can never be properly addressed, for archival/accessibility considerations,
as well as Copy/Paste.
Such characters occur within blocks:
   U+025A — U+02FF IPA Extensions
   U+1D00 — U+1D7F Phonetic Extensions
   U+A720 — U+A7FF Latin Extended-D
   U+FE50 — U+FE6F Small Form Variants

And of course there are superiors and inferiors in other blocks, which also
are affected, when glyph names are used, such as:
   i.superior   n.superior
   /zero.inferior /one.inferior   etc.
as is very commonly used in fonts.

Cheers

        Ross

Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ross.moore@mq.edu.au

http://www.maths.mq.edu.au

<http://mq.edu.au/>

CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.