Please tell me this isn't in a FAQ. :) Is there support for ActualText tags so that searching and extraction will work with OpenType fonts and Unicode? If so, do discretionary hyphens get treated as 00AD instead of 002D?
Barry Schwartz wrote:
Please tell me this isn't in a FAQ. :) Is there support for ActualText tags so that searching and extraction will work with OpenType fonts and Unicode? If so, do discretionary hyphens get treated as 00AD instead of 002D?
can you explain in mode detail what you mean with 'actual text tags' ? concerning searching ... tounicode vectors are added (including heuristics for ligatures and such) so searching, cut/past etc should work ok in order to see a problem with hyphens i need an example font/text combination that i can generate on my machine Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
can you explain in mode detail what you mean with 'actual text tags' ?
He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. It's a more generic way to support searching than ToUnicode vectors: you just specify the actual string of underlying Unicode characters. The PDF spec uses hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to search for "Drucker". You can't do that with ToUnicode vectors. Anyway, this needs support at the engine level and I don't think there is; actually it would be nice to add that to LuaTeX. Arthur
Am 19.09.2009 um 19:10 schrieb Arthur Reutenauer:
Anyway, this needs support at the engine level and I don't think there is; actually it would be nice to add that to LuaTeX.
Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX, why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)? Wolfgang
Heiko Oberdiek wrote the accsupp package to use ActualText in LaTeX, why shouldn't it be then possible to use it in LuaTeX (and ConTeXt)?
Right, you don't need additional engine support, you can use \pdfliteral in pdfTeX, and in LuaTeX as well. Heiko's package should be quite easy to port to ConTeXt. Arthur
Arthur Reutenauer
He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. It's a more generic way to support searching than ToUnicode vectors: you just specify the actual string of underlying Unicode characters. The PDF spec uses hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to search for "Drucker". You can't do that with ToUnicode vectors.
You also need ActualText tags to mark the difference between a discretionary hyphen and an explicit hyphen in English, which programs like Reader use when extracting text. When the hyphen is discretionary you set the ActualText to Unicode AD instead of 2D. (That's mentioned somewhere in the PDF spec.) Another thing I just thought of that isn't always done is that there should be explicit space characters between words, including at the ends of lines, although I'm not sure whether Adobe Reader turns off its word-boundary heuristics if it sees space characters. Since what I enjoy doing is making e-books that can be searched through and, perhaps more importantly, extracted from via the Select tool, it's important to me to make the search, selection, and extraction features work. I'll use them myself if I choose, for instance, to quote from an e-book I made. I've added them in my (heavily) modified version of ant, but that's in a primitive state, a long-term project that competes with font-making and e-book-making for time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot. Also, I noticed when playing around with the examples from the "Th" ligature discussion that searching and extraction didn't work with small caps, though it did work with the ligature. With ActualText tags these things always work, regardless of the ToUnicode map's contents. The way Cairo's PDF backend handles this is to use an ActualText tag for any glyphs that aren't included in the font's encoding. What I did in my modified ant is to generate a ToUnicode map from the Adobe glyph naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then put an ActualText tag on anything that happens not to match what you would get from the ToUnicode mapping. (For reasons that were stupid, I once created a lame little C library to do the mapping from glyph names to Unicode, using a compressed lookup trie: http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-librari... )
Barry Schwartz wrote:
Also, I noticed when playing around with the examples from the "Th" ligature discussion that searching and extraction didn't work with small caps, though it did work with the ligature. With ActualText tags
hm, mkiv has an analyser for names->unicode and afaik small caps should work, unless the glyph name cannot be interpreted (as i don't have the font i cannot see what happens or what goes wrong here)
these things always work, regardless of the ToUnicode map's contents. The way Cairo's PDF backend handles this is to use an ActualText tag for any glyphs that aren't included in the font's encoding. What I did in my modified ant is to generate a ToUnicode map from the Adobe glyph naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) and then
thanks for the pointer
put an ActualText tag on anything that happens not to match what you would get from the ToUnicode mapping.
hm, if one knows the character (say c) then why not adapt the tounicode vector Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen
put an ActualText tag on anything that happens not to match what you would get from the ToUnicode mapping.
hm, if one knows the character (say c) then why not adapt the tounicode vector
The same glyph could correspond to different Unicode in the source. This is exactly what happens normally with hyphens. In practice what I see with my method is that discretionary hyphens always get an ActualText, and if the font is older and has names like "Asmall" or "ffl" (which I don't bother handling specially) then the substituted stuff gets an ActualText. I could look at the font's internal encoding the way I think Cairo does, but it doesn't matter a whole lot.
Barry Schwartz
In practice what I see with my method is that discretionary hyphens always get an ActualText, and if the font is older and has names like "Asmall" or "ffl" (which I don't bother handling specially) then the substituted stuff gets an ActualText. I could look at the font's internal encoding the way I think Cairo does, but it doesn't matter a whole lot.
Oops, "ffl" is in the Adobe Glyph List and so would get put into the ToUnicode. Something like "ffh" wouldn't, however, but "f_f_h" would because it can broken down into parts that are in the AGL.
Arthur Reutenauer wrote:
can you explain in mode detail what you mean with 'actual text tags' ?
He means "ActualText tags" :-) See the PDF spec section 14.9.4, page 623. It's a more generic way to support searching than ToUnicode vectors: you just specify the actual string of underlying Unicode characters. The PDF spec uses hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want to search for "Drucker". You can't do that with ToUnicode vectors.
Anyway, this needs support at the engine level and I don't think there is; actually it would be nice to add that to LuaTeX.
hm, if done with words it's probably doable with an unadapted engine (esp when we have a cleaner pdfliteral model, which is on the agenda) \starttext \dorecurse{100}{test } \pdfliteral{/Span <> BDC}arthur\pdfliteral{EMC} \dorecurse{100}{test } \stoptext not that hard to implement if we add a span around each word Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (4)
-
Arthur Reutenauer
-
Barry Schwartz
-
Hans Hagen
-
Wolfgang Schuster