Re: [NTG-pdftex] Patch to support CMap namespaces

28 Aug 2008

can you please submit your patch to
http://sarovar.org/tracker/?atid=495&group_id=106&func=browse

thanks,
Thanh

On Thu, Aug 28, 2008 at 12:59:51AM +0300, Vasile Gaburici wrote:
...
There are a couple of LaTeX packages out there that provide CMaps.
They don't work as well as \pdfglyphtounicode, i.e. virtual fonts
don't get CMaps at all (the CMap is included in the PDF but not
referenced), and otftotfm-installed fonts lack the CMap entries for
the ligatures that otftotfm sneaks in empty slots. As you know,
\pdfglyphtounicode fixes these problems.
On the other hand, these two packages let the user specify a CMap for
each LaTeX encoding, so the user ca give different Unicode values to
the same PS glyph name in different LaTeX encodings. Of course that
works properly only if the fonts invoked by the different LaTeX
encodings are different; otherwise only one can win the \pdffontattr.
A compelling application of this feature are CMaps that set math code
points (usually above BMP) for TeX math fonts; those glyphs have
exactly the same names as in text fonts /A etc. Adding namespaces to
\pdfglyphtounicode makes those two packages obsolete in their current
implementation.
Another advantage of namespaces is the ability to (reliably) fix
TrueType font CMaps. The troublesome glyphs are usually ligatures that
don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm
writes as /indexZZZ in the enc file. Putting those in a per-font
namespace avoids any potential clashes.
So, I've patched pdftex to provide namespaces using the following
syntax extension: the first argument of \pdfglyphtounicode can now
take additional forms:
\pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...}
\pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...}
Since fonts for which the built-in encoding is used happen to be
exactly those that have multiple design sizes (cmr, stmary etc.),
using a separate ps-enc-name for each is not helpful. Instead, the
'enc' namespace for those is obtained by dropping any final digits
from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes
only).
The search policy is to first search the font namespce, then the
encoding, and finally the global namespace, for which the syntax
remains unchanged. All these namespace are implemented in the same avl
tree; just using the above strings as key names. In theory this makes
the search 3 times slower, but that particular phase of pdftex hardly
takes any time, so it seemed premature to implement any optimization.
Some usage examples:
% make the ti ligature searchable in Calibri regular
\pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069}
% go crazy with Unicode math; TeX math italic gives above-BMP math A
\pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required
Note that search behavior for math letters varies with pdf viewers.
Acrobat implements only canonical equivalence, so you need to enter
the exact code point, and copy/paste preserves the code points, so you
can paste into a LaTeX document if it's using utf8x input encoding.
Evince implements compatibility equivalence, so it's easier to find
those math As by searching for plain A, but they also copy/paste as
plain A. You can use pdftotext however, which uses the same poppler
backend, to have the code points are preserved. I'm not really
advocating Unicode math letters, but now they're easily supported in
pdftex -- no need for manual CMaps anymore.
BTW, \pdfglyphtounicode now really needs to be documented in the
manual, so people would stop writing (buggy) CMaps by hand. I
volunteer to do it if you accept the patch :)
I also wrote some CMap handling tools, mostly for verification, I'll
send a separate announcement about that.
...
diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h

--- pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons	2008-08-26 17:44:23.000000000 +0300
+++ pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h	2008-08-26 17:45:35.000000000 +0300
@@ -199,7 +199,7 @@ extern boolean handle_subfont_fm(fm_entr
 /* tounicode.c */
 extern void glyph_unicode_free(void);
 extern void deftounicode(strnumber, strnumber);
-extern integer write_tounicode(char **, char *);
+extern integer write_tounicode(char **, const char *, const char *);
/* utils.c */
 extern boolean str_eq_cstr(strnumber, char *);
diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c
--- pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons	2008-08-26 11:37:18.000000000 +0300
+++ pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c	2008-08-27 21:41:54.000000000 +0300
@@ -127,7 +127,7 @@ void deftounicode(strnumber glyph, strnu
 }
-static long check_unicode_value(char *s, boolean multiple_value)
+static long check_unicode_value(const char *s, boolean multiple_value)
 {
     int l = strlen(s);
     int i;
@@ -184,12 +184,15 @@ static char *utf16be_str(long code)
 }
-/* this function set proper values to *gp based on s; in case it returns
+/* this function writes /ToUnicode data to *gp based on glyph name s and
+ * taking into account fntname and encname; in case it returns
  * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing
  * gp->unicode_seq too */
-static void set_glyph_unicode(char *s, glyph_unicode_entry * gp)
+static void set_glyph_unicode(const char *s, const char* fntname,
+                              const char* encname, glyph_unicode_entry *gp)
 {
     char buf[SMALL_BUF_SIZE], buf2[SMALL_BUF_SIZE], *p;
+    const char *p2; /* p2 points in s; p above points in writable copies */
     long code;
     boolean last_component;
     glyph_unicode_entry tmp, *ptmp;
@@ -223,7 +226,7 @@ static void set_glyph_unicode(char *s, g
         for (;;) {
             *p = 0;
             tmp.code = UNI_UNDEF;
-            set_glyph_unicode(s, &tmp);
+            set_glyph_unicode(s, fntname, encname, &tmp);
             switch (tmp.code) {
             case UNI_UNDEF:    /* not found, do nothing */
                 break;
@@ -256,8 +259,32 @@ static void set_glyph_unicode(char *s, g
         return;
     }
-    /* lookup for glyph name in the database */
-    tmp.name = s;
+    /* Glyph name search strategy: first look up the glyph name in the
+       font's namespace, failing that look it up in the PS encoding
+       namespace, and finally look it up in the main database. */
+    /* Note: buf may alias s in the code below, but s and buf2 are
+       guaranteed to be distinct because the code changing buf2 above
+       always returns before reaching the code below. */
+    snprintf(buf2, SMALL_BUF_SIZE, "fnt:%s/%s", fntname, s);
+    tmp.name = buf2;
+    tmp.code = UNI_UNDEF;
+    ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
+    if (ptmp != NULL) {
+        gp->code = ptmp->code;
+        gp->unicode_seq = ptmp->unicode_seq;
+        return;
+    }
+    snprintf(buf2, SMALL_BUF_SIZE, "enc:%s/%s", encname, s);
+    tmp.name = buf2;
+    tmp.code = UNI_UNDEF;
+    ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
+    if (ptmp != NULL) {
+        gp->code = ptmp->code;
+        gp->unicode_seq = ptmp->unicode_seq;
+        return;
+    }
+    tmp.name = (char *)s; /* this is okay since we're not calling
+                             destroy_glyph_unicode_entry on this */
     tmp.code = UNI_UNDEF;
     ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
     if (ptmp != NULL) {
@@ -268,14 +295,14 @@ static void set_glyph_unicode(char *s, g
/* check for case of "uniXXXX" (multiple 4-hex-digit values allowed) */
     if (str_prefix(s, "uni")) {
-        p = s + strlen("uni");
-        code = check_unicode_value(p, true);
+        p2 = s + strlen("uni");
+        code = check_unicode_value(p2, true);
         if (code != UNI_UNDEF) {
-            if (strlen(p) == 4) /* single value */
+            if (strlen(p2) == 4) /* single value */
                 gp->code = code;
             else {              /* multiple value */
                 gp->code = UNI_EXTRA_STRING;
-                gp->unicode_seq = xstrdup(p);
+                gp->unicode_seq = xstrdup(p2);
             }
         }
         return;                 /* since the last case cannot happen */
@@ -283,8 +310,8 @@ static void set_glyph_unicode(char *s, g
/* check for case of "uXXXX" (single value up to 6 hex digits) */
     if (str_prefix(s, "u")) {
-        p = s + strlen("u");
-        code = check_unicode_value(p, false);
+        p2 = s + strlen("u");
+        code = check_unicode_value(p2, false);
         if (code != UNI_UNDEF) {
             assert(code >= 0);
             gp->code = code;
@@ -292,7 +319,9 @@ static void set_glyph_unicode(char *s, g
     }
 }
-integer write_tounicode(char **glyph_names, char *name)
+/* tex font name is bare (no .tfm), but enc name is ending in .enc; */
+integer write_tounicode(char **glyph_names, const char *texname,
+                        const char* encname)
 {
     char buf[SMALL_BUF_SIZE], *p;
     static char builtin_suffix[] = "-builtin";
@@ -301,18 +330,24 @@ integer write_tounicode(char **glyph_nam
     integer objnum;
     int i, j;
     int bfchar_count, bfrange_count, subrange_count;
-    assert(strlen(name) + strlen(builtin_suffix) < SMALL_BUF_SIZE);
+
     if (glyph_unicode_tree == NULL) {
         pdftex_warn("no GlyphToUnicode entry has been inserted yet!");
         fixedgentounicode = 0;
         return 0;
     }
-    strcpy(buf, name);
-    if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0)
-        *p = 0;                 /* strip ".enc" from encoding name */
-    else
-        strcat(buf, builtin_suffix);    /* ".enc" not present, this is a builtin
-                                           encoding so the name is eg "cmr10-builtin" */
+    if (encname) {
+        assert(strlen(encname) < SMALL_BUF_SIZE);
+        strcpy(buf, encname);
+        if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0)
+            *p = 0;                 /* strip ".enc" from encoding name */
+        else /* some silly encoding file name not ending in enc; use as-is */
+            pdftex_warn("Dubious encoding file name: `%s'", encname);
+    } else { /* this is a builtin encoding, so name is e.g. "cmr10-builtin" */
+        assert(strlen(texname) + strlen(builtin_suffix) < SMALL_BUF_SIZE);
+        strcpy(buf, texname);
+        strcat(buf, builtin_suffix);
+    }
     objnum = pdfnewobjnum();
     pdfbegindict(objnum, 0);
     pdfbeginstream();
@@ -336,10 +371,23 @@ integer write_tounicode(char **glyph_nam
                "1 begincodespacerange\n"
                "<00> <FF>\n" "endcodespacerange\n", buf, buf, buf, buf, buf);
+    /* Fonts with built-in encoding have a unique encoding name so
+       looking up that encoding does not buy us any grouping. Instead,
+       we group them by dropping any terminal digits from their name. */
+    if (!encname) {
+        strcpy(buf, texname);
+        for (p = buf + strlen(buf) - 1; p > buf; --p) {
+            if (*p >= '0' && *p <= '9')
+                *p = 0;
+            else
+                break;
+        }
+    }
+
     /* set gtab */
     for (i = 0; i < 256; ++i) {
         gtab[i].code = UNI_UNDEF;
-        set_glyph_unicode(glyph_names[i], >ab[i]);
+        set_glyph_unicode(glyph_names[i], texname, buf, >ab[i]);
     }
     gtab[256].code = UNI_UNDEF;
diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c
--- pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons	2008-08-26 17:22:10.000000000 +0300
+++ pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c	2008-08-26 17:42:56.000000000 +0300
@@ -534,11 +534,11 @@ void write_fontdictionary(fo_entry * fo)
     if (fixedgentounicode > 0 && fo->fd != NULL) {
         if (fo->fe != NULL) {
             fo->tounicode_objnum =
-                write_tounicode(fo->fe->glyph_names, fo->fe->name);
+                write_tounicode(fo->fe->glyph_names, fo->fm->tfm_name, fo->fe->name);
         } else if (is_type1(fo->fm)) {
             assert(fo->fd->builtin_glyph_names != NULL);
             fo->tounicode_objnum =
-                write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name);
+                write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name, NULL);
         }
     }

...
_______________________________________________
ntg-pdftex mailing list
ntg-pdftex@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-pdftex

    

Thanh Han The

tags

participants (1)