Thomas A. Schmitz wrote:
For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for.
regimes are a solution, but what solution is best depends on the input stream ... whole document? partial document? also written to external files? evenually everything can become a unicode, (private aereas) and as such travel through the system; of we can misuse virtual fonts ...
we could plug into the input stream reading routine (just like other regimes work).
there are mechanisms for that (because that's what i played al lot with last year; there was (maybe even is) a mechanism for chained processing of input etc
actually tell ConTeXt that you are handling Latin characters with a special appearance (that the font takes care of), so for example, the underlying text in a PDF would be a stream of Latin characters, and copying-and-pasting would yield Latin characters, not Greek.
not entirely true ... we can (and do) intercept the node stream ... ok, at that point we're dealing with a font/char pair, but we can chang ethe char (or node) to whatever we like ... depends on the problem
The question of copy-and-paste is one of the big mysteries, and I have no clue why it works in some cases, but not in others. Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input. Never touch a running system: I just take this as some sort of divine favor and leave it at that...
that's a matter of associating tounicode points, of course, no unicode means no copy/paste -)
That is not what you want here: you want your "a" to be understood as "alpha" and your "less-than acute-sign w vertical-bar" to be considered an "omega with dasia, varia and subscribed iota". Nor should you think of these transformations as a collection of ligatures (which act at the font level), but rather as a text encoding, just like UTF-8 is an encoding of the Unicode characters: in UTF-8 the byte sequence "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, and in the Babel input scheme for Ancient Greek the same character is encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'], hexadecimal byte 61 [ASCII 'a']".
Yes, that's crystal clear. It would also take care of another problem: in the input stream, you know exactly which character sequence translates to what. On the font level, legacy fonts sometimes have their own ideas about where to put certain glyphs.
depends ... the input char becomes a node, now, if (probably controlled by attributes) a certain char is sees (say 'a') and you want it to be an alpha, well, we can change that char then in the node,
Of course in the past, these transformations were handled at the font level and sequences like "< a" were actually ligatures, because that was all we had (and copypasting from a PDF was, mostly, doomed to fail); but we should not persist in that use now we can treat them as real Unicode characters.
those hard coded mechanism were indeed not sufficient Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------