[pdftex-Patches][2087] Support for CMap namespaces

19 Dec 2010

      Patches item #2087, was opened at 2008-08-28 08:17
Status: Open
Priority: 3
Submitted By: Vasile Gaburici (vga)
Assigned to: The Thanh Han (hanthethanh)
Summary: Support for CMap namespaces 
Category: Fonts
Group: v1.40.0
Resolution: Accepted

Initial Comment:
There are a couple of LaTeX packages out there (cmap and mmap) that provide CMaps. They don't work as well as \pdfglyphtounicode, i.e. virtual fonts
don't get CMaps at all (the CMap is included in the PDF but not
referenced), and otftotfm-installed fonts lack the CMap entries for
the ligatures that otftotfm sneaks in empty slots. As you know,
\pdfglyphtounicode fixes these problems.

On the other hand, these two packages let the user specify a CMap for
each LaTeX encoding, so the user ca give different Unicode values to
the same PS glyph name in different LaTeX encodings. Of course that
works properly only if the fonts invoked by the different LaTeX
encodings are different; otherwise only one can win the \pdffontattr.
A compelling application of this feature are CMaps that set math code
points (usually above BMP) for TeX math fonts; those glyphs have
exactly the same names as in text fonts /A etc. Adding namespaces to
\pdfglyphtounicode makes those two packages obsolete in their current
implementation.

Another advantage of namespaces is the ability to (reliably) fix
TrueType font CMaps. The troublesome glyphs are usually ligatures that
don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm
writes as /indexZZZ in the enc file. Putting those in a per-font
namespace avoids any potential clashes.

So, I've patched pdftex to provide namespaces using the following
syntax extension: the first argument of \pdfglyphtounicode can now
take additional forms:
\pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...}
\pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...}

Since fonts for which the built-in encoding is used happen to be
exactly those that have multiple design sizes (cmr, stmary etc.),
using a separate ps-enc-name for each is not helpful. Instead, the
'enc' namespace for those is obtained by dropping any final digits
from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes
only).

The search policy is to first search the font namespce, then the
encoding, and finally the global namespace, for which the syntax
remains unchanged. All these namespace are implemented in the same avl
tree; just using the above strings as key names. In theory this makes
the search 3 times slower, but that particular phase of pdftex hardly
takes any time, so it seemed premature to implement any optimization.

Some usage examples:

% make the ti ligature searchable in Calibri regular
\pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069}
% go crazy with Unicode math; TeX math italic gives above-BMP math A
\pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required

Note that search behavior for math letters varies with pdf viewers.
Acrobat implements only canonical equivalence, so you need to enter
the exact code point, and copy/paste preserves the code points, so you
can paste into a LaTeX document if it's using utf8x input encoding.
Evince implements compatibility equivalence, so it's easier to find
those math As by searching for plain A, but they also copy/paste as
plain A. You can use pdftotext however, which uses the same poppler
backend, to have the code points are preserved. I'm not really
advocating Unicode math letters, but now they're easily supported in
pdftex -- no need for manual CMaps anymore.

BTW, \pdfglyphtounicode now really needs to be documented in the
manual, so people would stop writing (buggy) CMaps by hand. I
volunteer to do it if you accept the patch :)

----------------------------------------------------------------------
...
Comment By: The Thanh Han (hanthethanh)
Date: 2010-12-19 06:45
Message:
patch updated to allow fontfile namespace, as suggested by Eddie Kohler.

----------------------------------------------------------------------

Comment By: The Thanh Han (hanthethanh)
Date: 2010-09-08 08:36

Message:
re-open this case, since the "enc" namespace seems useful and would make glyphtounicode.tex simpler.

----------------------------------------------------------------------

Comment By: The Thanh Han (hanthethanh)
Date: 2009-12-01 22:02

Message:
included in 1.40.10

----------------------------------------------------------------------

Comment By: The Thanh Han (hanthethanh)
Date: 2009-04-07 14:59

Message:
this patch is indeed very useful, however adding tfm namespace support is enough since it can cover the other case. I have changed the patch slightly to:
- drop the enc: case 
- use "tfm:" instead of "fnt:" as the tfm namespace prefix

And added a small test file.

----------------------------------------------------------------------

You can respond by visiting: 
http://sarovar.org/tracker/?func=detail&atid=495&aid=2087&group_id=106

pdftex-patches＠sarovar.org

tags

participants (1)