[pdftex-Patches][2087] Support for CMap namespaces
Patches item #2087, was opened at 2008-08-28 08:17
Status: Closed Priority: 3 Submitted By: Vasile Gaburici (vga) Assigned to: Nobody (None) Summary: Support for CMap namespaces Category: Fonts Group: v1.40.0 Resolution: Accepted
Initial Comment: There are a couple of LaTeX packages out there (cmap and mmap) that provide CMaps. They don't work as well as \pdfglyphtounicode, i.e. virtual fonts don't get CMaps at all (the CMap is included in the PDF but not referenced), and otftotfm-installed fonts lack the CMap entries for the ligatures that otftotfm sneaks in empty slots. As you know, \pdfglyphtounicode fixes these problems. On the other hand, these two packages let the user specify a CMap for each LaTeX encoding, so the user ca give different Unicode values to the same PS glyph name in different LaTeX encodings. Of course that works properly only if the fonts invoked by the different LaTeX encodings are different; otherwise only one can win the \pdffontattr. A compelling application of this feature are CMaps that set math code points (usually above BMP) for TeX math fonts; those glyphs have exactly the same names as in text fonts /A etc. Adding namespaces to \pdfglyphtounicode makes those two packages obsolete in their current implementation. Another advantage of namespaces is the ability to (reliably) fix TrueType font CMaps. The troublesome glyphs are usually ligatures that don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm writes as /indexZZZ in the enc file. Putting those in a per-font namespace avoids any potential clashes. So, I've patched pdftex to provide namespaces using the following syntax extension: the first argument of \pdfglyphtounicode can now take additional forms: \pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...} \pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...} Since fonts for which the built-in encoding is used happen to be exactly those that have multiple design sizes (cmr, stmary etc.), using a separate ps-enc-name for each is not helpful. Instead, the 'enc' namespace for those is obtained by dropping any final digits from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes only). The search policy is to first search the font namespce, then the encoding, and finally the global namespace, for which the syntax remains unchanged. All these namespace are implemented in the same avl tree; just using the above strings as key names. In theory this makes the search 3 times slower, but that particular phase of pdftex hardly takes any time, so it seemed premature to implement any optimization. Some usage examples: % make the ti ligature searchable in Calibri regular \pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069} % go crazy with Unicode math; TeX math italic gives above-BMP math A \pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required Note that search behavior for math letters varies with pdf viewers. Acrobat implements only canonical equivalence, so you need to enter the exact code point, and copy/paste preserves the code points, so you can paste into a LaTeX document if it's using utf8x input encoding. Evince implements compatibility equivalence, so it's easier to find those math As by searching for plain A, but they also copy/paste as plain A. You can use pdftotext however, which uses the same poppler backend, to have the code points are preserved. I'm not really advocating Unicode math letters, but now they're easily supported in pdftex -- no need for manual CMaps anymore. BTW, \pdfglyphtounicode now really needs to be documented in the manual, so people would stop writing (buggy) CMaps by hand. I volunteer to do it if you accept the patch :) ----------------------------------------------------------------------
Comment By: The Thanh Han (hanthethanh) Date: 2009-12-01 22:02
Message: included in 1.40.10 ---------------------------------------------------------------------- Comment By: The Thanh Han (hanthethanh) Date: 2009-04-07 14:59 Message: this patch is indeed very useful, however adding tfm namespace support is enough since it can cover the other case. I have changed the patch slightly to: - drop the enc: case - use "tfm:" instead of "fnt:" as the tfm namespace prefix And added a small test file. ---------------------------------------------------------------------- You can respond by visiting: http://sarovar.org/tracker/?func=detail&atid=495&aid=2087&group_id=106
participants (1)
-
pdftex-patches@sarovar.org