can you please submit your patch to http://sarovar.org/tracker/?atid=495&group_id=106&func=browse thanks, Thanh On Thu, Aug 28, 2008 at 12:59:51AM +0300, Vasile Gaburici wrote:
There are a couple of LaTeX packages out there that provide CMaps. They don't work as well as \pdfglyphtounicode, i.e. virtual fonts don't get CMaps at all (the CMap is included in the PDF but not referenced), and otftotfm-installed fonts lack the CMap entries for the ligatures that otftotfm sneaks in empty slots. As you know, \pdfglyphtounicode fixes these problems.
On the other hand, these two packages let the user specify a CMap for each LaTeX encoding, so the user ca give different Unicode values to the same PS glyph name in different LaTeX encodings. Of course that works properly only if the fonts invoked by the different LaTeX encodings are different; otherwise only one can win the \pdffontattr. A compelling application of this feature are CMaps that set math code points (usually above BMP) for TeX math fonts; those glyphs have exactly the same names as in text fonts /A etc. Adding namespaces to \pdfglyphtounicode makes those two packages obsolete in their current implementation.
Another advantage of namespaces is the ability to (reliably) fix TrueType font CMaps. The troublesome glyphs are usually ligatures that don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm writes as /indexZZZ in the enc file. Putting those in a per-font namespace avoids any potential clashes.
So, I've patched pdftex to provide namespaces using the following syntax extension: the first argument of \pdfglyphtounicode can now take additional forms: \pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...} \pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...}
Since fonts for which the built-in encoding is used happen to be exactly those that have multiple design sizes (cmr, stmary etc.), using a separate ps-enc-name for each is not helpful. Instead, the 'enc' namespace for those is obtained by dropping any final digits from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes only).
The search policy is to first search the font namespce, then the encoding, and finally the global namespace, for which the syntax remains unchanged. All these namespace are implemented in the same avl tree; just using the above strings as key names. In theory this makes the search 3 times slower, but that particular phase of pdftex hardly takes any time, so it seemed premature to implement any optimization.
Some usage examples:
% make the ti ligature searchable in Calibri regular \pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069} % go crazy with Unicode math; TeX math italic gives above-BMP math A \pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required
Note that search behavior for math letters varies with pdf viewers. Acrobat implements only canonical equivalence, so you need to enter the exact code point, and copy/paste preserves the code points, so you can paste into a LaTeX document if it's using utf8x input encoding. Evince implements compatibility equivalence, so it's easier to find those math As by searching for plain A, but they also copy/paste as plain A. You can use pdftotext however, which uses the same poppler backend, to have the code points are preserved. I'm not really advocating Unicode math letters, but now they're easily supported in pdftex -- no need for manual CMaps anymore.
BTW, \pdfglyphtounicode now really needs to be documented in the manual, so people would stop writing (buggy) CMaps by hand. I volunteer to do it if you accept the patch :)
I also wrote some CMap handling tools, mostly for verification, I'll send a separate announcement about that.
diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h --- pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons 2008-08-26 17:44:23.000000000 +0300 +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h 2008-08-26 17:45:35.000000000 +0300 @@ -199,7 +199,7 @@ extern boolean handle_subfont_fm(fm_entr /* tounicode.c */ extern void glyph_unicode_free(void); extern void deftounicode(strnumber, strnumber); -extern integer write_tounicode(char **, char *); +extern integer write_tounicode(char **, const char *, const char *);
/* utils.c */ extern boolean str_eq_cstr(strnumber, char *); diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c --- pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons 2008-08-26 11:37:18.000000000 +0300 +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c 2008-08-27 21:41:54.000000000 +0300 @@ -127,7 +127,7 @@ void deftounicode(strnumber glyph, strnu }
-static long check_unicode_value(char *s, boolean multiple_value) +static long check_unicode_value(const char *s, boolean multiple_value) { int l = strlen(s); int i; @@ -184,12 +184,15 @@ static char *utf16be_str(long code) }
-/* this function set proper values to *gp based on s; in case it returns +/* this function writes /ToUnicode data to *gp based on glyph name s and + * taking into account fntname and encname; in case it returns * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing * gp->unicode_seq too */ -static void set_glyph_unicode(char *s, glyph_unicode_entry * gp) +static void set_glyph_unicode(const char *s, const char* fntname, + const char* encname, glyph_unicode_entry *gp) { char buf[SMALL_BUF_SIZE], buf2[SMALL_BUF_SIZE], *p; + const char *p2; /* p2 points in s; p above points in writable copies */ long code; boolean last_component; glyph_unicode_entry tmp, *ptmp; @@ -223,7 +226,7 @@ static void set_glyph_unicode(char *s, g for (;;) { *p = 0; tmp.code = UNI_UNDEF; - set_glyph_unicode(s, &tmp); + set_glyph_unicode(s, fntname, encname, &tmp); switch (tmp.code) { case UNI_UNDEF: /* not found, do nothing */ break; @@ -256,8 +259,32 @@ static void set_glyph_unicode(char *s, g return; }
- /* lookup for glyph name in the database */ - tmp.name = s; + /* Glyph name search strategy: first look up the glyph name in the + font's namespace, failing that look it up in the PS encoding + namespace, and finally look it up in the main database. */ + /* Note: buf may alias s in the code below, but s and buf2 are + guaranteed to be distinct because the code changing buf2 above + always returns before reaching the code below. */ + snprintf(buf2, SMALL_BUF_SIZE, "fnt:%s/%s", fntname, s); + tmp.name = buf2; + tmp.code = UNI_UNDEF; + ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp); + if (ptmp != NULL) { + gp->code = ptmp->code; + gp->unicode_seq = ptmp->unicode_seq; + return; + } + snprintf(buf2, SMALL_BUF_SIZE, "enc:%s/%s", encname, s); + tmp.name = buf2; + tmp.code = UNI_UNDEF; + ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp); + if (ptmp != NULL) { + gp->code = ptmp->code; + gp->unicode_seq = ptmp->unicode_seq; + return; + } + tmp.name = (char *)s; /* this is okay since we're not calling + destroy_glyph_unicode_entry on this */ tmp.code = UNI_UNDEF; ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp); if (ptmp != NULL) { @@ -268,14 +295,14 @@ static void set_glyph_unicode(char *s, g
/* check for case of "uniXXXX" (multiple 4-hex-digit values allowed) */ if (str_prefix(s, "uni")) { - p = s + strlen("uni"); - code = check_unicode_value(p, true); + p2 = s + strlen("uni"); + code = check_unicode_value(p2, true); if (code != UNI_UNDEF) { - if (strlen(p) == 4) /* single value */ + if (strlen(p2) == 4) /* single value */ gp->code = code; else { /* multiple value */ gp->code = UNI_EXTRA_STRING; - gp->unicode_seq = xstrdup(p); + gp->unicode_seq = xstrdup(p2); } } return; /* since the last case cannot happen */ @@ -283,8 +310,8 @@ static void set_glyph_unicode(char *s, g
/* check for case of "uXXXX" (single value up to 6 hex digits) */ if (str_prefix(s, "u")) { - p = s + strlen("u"); - code = check_unicode_value(p, false); + p2 = s + strlen("u"); + code = check_unicode_value(p2, false); if (code != UNI_UNDEF) { assert(code >= 0); gp->code = code; @@ -292,7 +319,9 @@ static void set_glyph_unicode(char *s, g } }
-integer write_tounicode(char **glyph_names, char *name) +/* tex font name is bare (no .tfm), but enc name is ending in .enc; */ +integer write_tounicode(char **glyph_names, const char *texname, + const char* encname) { char buf[SMALL_BUF_SIZE], *p; static char builtin_suffix[] = "-builtin"; @@ -301,18 +330,24 @@ integer write_tounicode(char **glyph_nam integer objnum; int i, j; int bfchar_count, bfrange_count, subrange_count; - assert(strlen(name) + strlen(builtin_suffix) < SMALL_BUF_SIZE); + if (glyph_unicode_tree == NULL) { pdftex_warn("no GlyphToUnicode entry has been inserted yet!"); fixedgentounicode = 0; return 0; } - strcpy(buf, name); - if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0) - *p = 0; /* strip ".enc" from encoding name */ - else - strcat(buf, builtin_suffix); /* ".enc" not present, this is a builtin - encoding so the name is eg "cmr10-builtin" */ + if (encname) { + assert(strlen(encname) < SMALL_BUF_SIZE); + strcpy(buf, encname); + if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0) + *p = 0; /* strip ".enc" from encoding name */ + else /* some silly encoding file name not ending in enc; use as-is */ + pdftex_warn("Dubious encoding file name: `%s'", encname); + } else { /* this is a builtin encoding, so name is e.g. "cmr10-builtin" */ + assert(strlen(texname) + strlen(builtin_suffix) < SMALL_BUF_SIZE); + strcpy(buf, texname); + strcat(buf, builtin_suffix); + } objnum = pdfnewobjnum(); pdfbegindict(objnum, 0); pdfbeginstream(); @@ -336,10 +371,23 @@ integer write_tounicode(char **glyph_nam "1 begincodespacerange\n" "<00> <FF>\n" "endcodespacerange\n", buf, buf, buf, buf, buf);
+ /* Fonts with built-in encoding have a unique encoding name so + looking up that encoding does not buy us any grouping. Instead, + we group them by dropping any terminal digits from their name. */ + if (!encname) { + strcpy(buf, texname); + for (p = buf + strlen(buf) - 1; p > buf; --p) { + if (*p >= '0' && *p <= '9') + *p = 0; + else + break; + } + } + /* set gtab */ for (i = 0; i < 256; ++i) { gtab[i].code = UNI_UNDEF; - set_glyph_unicode(glyph_names[i], >ab[i]); + set_glyph_unicode(glyph_names[i], texname, buf, >ab[i]); } gtab[256].code = UNI_UNDEF;
diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c --- pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons 2008-08-26 17:22:10.000000000 +0300 +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c 2008-08-26 17:42:56.000000000 +0300 @@ -534,11 +534,11 @@ void write_fontdictionary(fo_entry * fo) if (fixedgentounicode > 0 && fo->fd != NULL) { if (fo->fe != NULL) { fo->tounicode_objnum = - write_tounicode(fo->fe->glyph_names, fo->fe->name); + write_tounicode(fo->fe->glyph_names, fo->fm->tfm_name, fo->fe->name); } else if (is_type1(fo->fm)) { assert(fo->fd->builtin_glyph_names != NULL); fo->tounicode_objnum = - write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name); + write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name, NULL); } }
_______________________________________________ ntg-pdftex mailing list ntg-pdftex@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-pdftex