Hi Akira, Karl Thanks for your help in sorting out how \pdfglyphtounicode allows access to upper-plane code-points. However, I think there is still a problem in how pdftex constructs the /ToUnicode CMap. This concerns characters with glyph names that use the ‘.’ qualifying construction; e.g. a.sc, b.sc , … aacute.sc , W.alt , Theta.var1 , etc. It seems that there can be entries for these glyph names within the database, but those entries are never recovered to be written into the CMap. This is because of the following coding: tounicode.sty lines 187 onwards:
/* this function set proper values to *gp based on s; in case it returns * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing * gp->unicode_seq too */ static void set_glyph_unicode(char *s, glyph_unicode_entry * gp) { ...
/* strip everything after the first dot */ p = strchr(s, '.'); if (p != NULL) { *buf = 0; strncat(buf, s, p - s); s = buf; }
...
The origin of this coding is surely Adobe’s stated way to establish a default for which character to select for Copy/Paste, Searching, etc. *** when there is no guidance from a CMap or /ActualText entry. *** However, pdftex is making it impossible to set such CMap entries for glyphs with qualified names involving the ‘.’ character. In short, \pdfglyphtounicode allows replacement Unicode strings to be entered into the glyph-name database, but … … set_glyph_unicode never uses those entries, replacing them instead with the unqualified glyph name. The attached file explores this using the libertine-type1.sty package. (Make sure libertine.map is enabled, to use this example.) My suggestion for altering tounicode.c , within the set_glyph_unicode function block, is to test the full name (including ‘.’s) first, for a datbase entry. If found, use it. Otherwise, try again using just the prefix (as at present). Or in case a name is multiply qualified; e.g., delta.sc.ipa (occurs in cmu-tipx.enc ) also omega.sc.ipa q.sc.ipa f.sc.ipa then drop off the qualifications from the end. So test in order: delta.sc.ipa delta.sc delta Without a fix of this sort, the true small-cap characters that are in Unicode can never be properly addressed, for archival/accessibility considerations, as well as Copy/Paste. Such characters occur within blocks: U+025A — U+02FF IPA Extensions U+1D00 — U+1D7F Phonetic Extensions U+A720 — U+A7FF Latin Extended-D U+FE50 — U+FE6F Small Form Variants And of course there are superiors and inferiors in other blocks, which also are affected, when glyph names are used, such as: i.superior n.superior /zero.inferior /one.inferior etc. as is very commonly used in fonts. Cheers Ross Dr Ross Moore Mathematics Dept | 12 Wally’s Walk, 734 Macquarie University, NSW 2109, Australia T: +61 2 9850 8955 | F: +61 2 9850 8114 M:+61 407 288 255 | E: ross.moore@mq.edu.au http://www.maths.mq.edu.au http://mq.edu.au/ [cid:75d17d3b-7e73-4ee3-a688-50d035309531@ausprd01.prod.outlook.com] CRICOS Provider Number 00002J. Think before you print. Please consider the environment before printing this email. This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie University.