[NTG-pdftex] Patch to support CMap namespaces

Thanh Han The hanthethanh at gmail.com
Thu Aug 28 10:26:53 CEST 2008


can you please submit your patch to
http://sarovar.org/tracker/?atid=495&group_id=106&func=browse

thanks,
Thanh

On Thu, Aug 28, 2008 at 12:59:51AM +0300, Vasile Gaburici wrote:
> There are a couple of LaTeX packages out there that provide CMaps.
> They don't work as well as \pdfglyphtounicode, i.e. virtual fonts
> don't get CMaps at all (the CMap is included in the PDF but not
> referenced), and otftotfm-installed fonts lack the CMap entries for
> the ligatures that otftotfm sneaks in empty slots. As you know,
> \pdfglyphtounicode fixes these problems.
>
> On the other hand, these two packages let the user specify a CMap for
> each LaTeX encoding, so the user ca give different Unicode values to
> the same PS glyph name in different LaTeX encodings. Of course that
> works properly only if the fonts invoked by the different LaTeX
> encodings are different; otherwise only one can win the \pdffontattr.
> A compelling application of this feature are CMaps that set math code
> points (usually above BMP) for TeX math fonts; those glyphs have
> exactly the same names as in text fonts /A etc. Adding namespaces to
> \pdfglyphtounicode makes those two packages obsolete in their current
> implementation.
>
> Another advantage of namespaces is the ability to (reliably) fix
> TrueType font CMaps. The troublesome glyphs are usually ligatures that
> don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm
> writes as /indexZZZ in the enc file. Putting those in a per-font
> namespace avoids any potential clashes.
>
> So, I've patched pdftex to provide namespaces using the following
> syntax extension: the first argument of \pdfglyphtounicode can now
> take additional forms:
> \pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...}
> \pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...}
>
> Since fonts for which the built-in encoding is used happen to be
> exactly those that have multiple design sizes (cmr, stmary etc.),
> using a separate ps-enc-name for each is not helpful. Instead, the
> 'enc' namespace for those is obtained by dropping any final digits
> from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes
> only).
>
> The search policy is to first search the font namespce, then the
> encoding, and finally the global namespace, for which the syntax
> remains unchanged. All these namespace are implemented in the same avl
> tree; just using the above strings as key names. In theory this makes
> the search 3 times slower, but that particular phase of pdftex hardly
> takes any time, so it seemed premature to implement any optimization.
>
> Some usage examples:
>
> % make the ti ligature searchable in Calibri regular
> \pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069}
> % go crazy with Unicode math; TeX math italic gives above-BMP math A
> \pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required
>
> Note that search behavior for math letters varies with pdf viewers.
> Acrobat implements only canonical equivalence, so you need to enter
> the exact code point, and copy/paste preserves the code points, so you
> can paste into a LaTeX document if it's using utf8x input encoding.
> Evince implements compatibility equivalence, so it's easier to find
> those math As by searching for plain A, but they also copy/paste as
> plain A. You can use pdftotext however, which uses the same poppler
> backend, to have the code points are preserved. I'm not really
> advocating Unicode math letters, but now they're easily supported in
> pdftex -- no need for manual CMaps anymore.
>
> BTW, \pdfglyphtounicode now really needs to be documented in the
> manual, so people would stop writing (buggy) CMaps by hand. I
> volunteer to do it if you accept the patch :)
>
> I also wrote some CMap handling tools, mostly for verification, I'll
> send a separate announcement about that.

> diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h
> --- pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h.nons	2008-08-26 17:44:23.000000000 +0300
> +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/ptexlib.h	2008-08-26 17:45:35.000000000 +0300
> @@ -199,7 +199,7 @@ extern boolean handle_subfont_fm(fm_entr
>  /* tounicode.c */
>  extern void glyph_unicode_free(void);
>  extern void deftounicode(strnumber, strnumber);
> -extern integer write_tounicode(char **, char *);
> +extern integer write_tounicode(char **, const char *, const char *);
>
>  /* utils.c */
>  extern boolean str_eq_cstr(strnumber, char *);
> diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c
> --- pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c.nons	2008-08-26 11:37:18.000000000 +0300
> +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/tounicode.c	2008-08-27 21:41:54.000000000 +0300
> @@ -127,7 +127,7 @@ void deftounicode(strnumber glyph, strnu
>  }
>
>
> -static long check_unicode_value(char *s, boolean multiple_value)
> +static long check_unicode_value(const char *s, boolean multiple_value)
>  {
>      int l = strlen(s);
>      int i;
> @@ -184,12 +184,15 @@ static char *utf16be_str(long code)
>  }
>
>
> -/* this function set proper values to *gp based on s; in case it returns
> +/* this function writes /ToUnicode data to *gp based on glyph name s and
> + * taking into account fntname and encname; in case it returns
>   * gp->code == UNI_EXTRA_STRING then the caller is responsible for freeing
>   * gp->unicode_seq too */
> -static void set_glyph_unicode(char *s, glyph_unicode_entry * gp)
> +static void set_glyph_unicode(const char *s, const char* fntname,
> +                              const char* encname, glyph_unicode_entry *gp)
>  {
>      char buf[SMALL_BUF_SIZE], buf2[SMALL_BUF_SIZE], *p;
> +    const char *p2; /* p2 points in s; p above points in writable copies */
>      long code;
>      boolean last_component;
>      glyph_unicode_entry tmp, *ptmp;
> @@ -223,7 +226,7 @@ static void set_glyph_unicode(char *s, g
>          for (;;) {
>              *p = 0;
>              tmp.code = UNI_UNDEF;
> -            set_glyph_unicode(s, &tmp);
> +            set_glyph_unicode(s, fntname, encname, &tmp);
>              switch (tmp.code) {
>              case UNI_UNDEF:    /* not found, do nothing */
>                  break;
> @@ -256,8 +259,32 @@ static void set_glyph_unicode(char *s, g
>          return;
>      }
>
> -    /* lookup for glyph name in the database */
> -    tmp.name = s;
> +    /* Glyph name search strategy: first look up the glyph name in the
> +       font's namespace, failing that look it up in the PS encoding
> +       namespace, and finally look it up in the main database. */
> +    /* Note: buf may alias s in the code below, but s and buf2 are
> +       guaranteed to be distinct because the code changing buf2 above
> +       always returns before reaching the code below. */
> +    snprintf(buf2, SMALL_BUF_SIZE, "fnt:%s/%s", fntname, s);
> +    tmp.name = buf2;
> +    tmp.code = UNI_UNDEF;
> +    ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
> +    if (ptmp != NULL) {
> +        gp->code = ptmp->code;
> +        gp->unicode_seq = ptmp->unicode_seq;
> +        return;
> +    }
> +    snprintf(buf2, SMALL_BUF_SIZE, "enc:%s/%s", encname, s);
> +    tmp.name = buf2;
> +    tmp.code = UNI_UNDEF;
> +    ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
> +    if (ptmp != NULL) {
> +        gp->code = ptmp->code;
> +        gp->unicode_seq = ptmp->unicode_seq;
> +        return;
> +    }
> +    tmp.name = (char *)s; /* this is okay since we're not calling
> +                             destroy_glyph_unicode_entry on this */
>      tmp.code = UNI_UNDEF;
>      ptmp = (glyph_unicode_entry *) avl_find(glyph_unicode_tree, &tmp);
>      if (ptmp != NULL) {
> @@ -268,14 +295,14 @@ static void set_glyph_unicode(char *s, g
>
>      /* check for case of "uniXXXX" (multiple 4-hex-digit values allowed) */
>      if (str_prefix(s, "uni")) {
> -        p = s + strlen("uni");
> -        code = check_unicode_value(p, true);
> +        p2 = s + strlen("uni");
> +        code = check_unicode_value(p2, true);
>          if (code != UNI_UNDEF) {
> -            if (strlen(p) == 4) /* single value */
> +            if (strlen(p2) == 4) /* single value */
>                  gp->code = code;
>              else {              /* multiple value */
>                  gp->code = UNI_EXTRA_STRING;
> -                gp->unicode_seq = xstrdup(p);
> +                gp->unicode_seq = xstrdup(p2);
>              }
>          }
>          return;                 /* since the last case cannot happen */
> @@ -283,8 +310,8 @@ static void set_glyph_unicode(char *s, g
>
>      /* check for case of "uXXXX" (single value up to 6 hex digits) */
>      if (str_prefix(s, "u")) {
> -        p = s + strlen("u");
> -        code = check_unicode_value(p, false);
> +        p2 = s + strlen("u");
> +        code = check_unicode_value(p2, false);
>          if (code != UNI_UNDEF) {
>              assert(code >= 0);
>              gp->code = code;
> @@ -292,7 +319,9 @@ static void set_glyph_unicode(char *s, g
>      }
>  }
>
> -integer write_tounicode(char **glyph_names, char *name)
> +/* tex font name is bare (no .tfm), but enc name is ending in .enc; */
> +integer write_tounicode(char **glyph_names, const char *texname,
> +                        const char* encname)
>  {
>      char buf[SMALL_BUF_SIZE], *p;
>      static char builtin_suffix[] = "-builtin";
> @@ -301,18 +330,24 @@ integer write_tounicode(char **glyph_nam
>      integer objnum;
>      int i, j;
>      int bfchar_count, bfrange_count, subrange_count;
> -    assert(strlen(name) + strlen(builtin_suffix) < SMALL_BUF_SIZE);
> +
>      if (glyph_unicode_tree == NULL) {
>          pdftex_warn("no GlyphToUnicode entry has been inserted yet!");
>          fixedgentounicode = 0;
>          return 0;
>      }
> -    strcpy(buf, name);
> -    if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0)
> -        *p = 0;                 /* strip ".enc" from encoding name */
> -    else
> -        strcat(buf, builtin_suffix);    /* ".enc" not present, this is a builtin
> -                                           encoding so the name is eg "cmr10-builtin" */
> +    if (encname) {
> +        assert(strlen(encname) < SMALL_BUF_SIZE);
> +        strcpy(buf, encname);
> +        if ((p = strrchr(buf, '.')) != NULL && strcmp(p, ".enc") == 0)
> +            *p = 0;                 /* strip ".enc" from encoding name */
> +        else /* some silly encoding file name not ending in enc; use as-is */
> +            pdftex_warn("Dubious encoding file name: `%s'", encname);
> +    } else { /* this is a builtin encoding, so name is e.g. "cmr10-builtin" */
> +        assert(strlen(texname) + strlen(builtin_suffix) < SMALL_BUF_SIZE);
> +        strcpy(buf, texname);
> +        strcat(buf, builtin_suffix);
> +    }
>      objnum = pdfnewobjnum();
>      pdfbegindict(objnum, 0);
>      pdfbeginstream();
> @@ -336,10 +371,23 @@ integer write_tounicode(char **glyph_nam
>                 "1 begincodespacerange\n"
>                 "<00> <FF>\n" "endcodespacerange\n", buf, buf, buf, buf, buf);
>
> +    /* Fonts with built-in encoding have a unique encoding name so
> +       looking up that encoding does not buy us any grouping. Instead,
> +       we group them by dropping any terminal digits from their name. */
> +    if (!encname) {
> +        strcpy(buf, texname);
> +        for (p = buf + strlen(buf) - 1; p > buf; --p) {
> +            if (*p >= '0' && *p <= '9')
> +                *p = 0;
> +            else
> +                break;
> +        }
> +    }
> +
>      /* set gtab */
>      for (i = 0; i < 256; ++i) {
>          gtab[i].code = UNI_UNDEF;
> -        set_glyph_unicode(glyph_names[i], &gtab[i]);
> +        set_glyph_unicode(glyph_names[i], texname, buf, &gtab[i]);
>      }
>      gtab[256].code = UNI_UNDEF;
>
> diff -up pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c
> --- pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c.nons	2008-08-26 17:22:10.000000000 +0300
> +++ pdftex-1.40.9/src/texk/web2c/pdftexdir/writefont.c	2008-08-26 17:42:56.000000000 +0300
> @@ -534,11 +534,11 @@ void write_fontdictionary(fo_entry * fo)
>      if (fixedgentounicode > 0 && fo->fd != NULL) {
>          if (fo->fe != NULL) {
>              fo->tounicode_objnum =
> -                write_tounicode(fo->fe->glyph_names, fo->fe->name);
> +                write_tounicode(fo->fe->glyph_names, fo->fm->tfm_name, fo->fe->name);
>          } else if (is_type1(fo->fm)) {
>              assert(fo->fd->builtin_glyph_names != NULL);
>              fo->tounicode_objnum =
> -                write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name);
> +                write_tounicode(fo->fd->builtin_glyph_names, fo->fm->tfm_name, NULL);
>          }
>      }
>

> _______________________________________________
> ntg-pdftex mailing list
> ntg-pdftex at ntg.nl
> http://www.ntg.nl/mailman/listinfo/ntg-pdftex


More information about the ntg-pdftex mailing list