Hi Arthur, first of all: thank you so much for your time and your expertise! Your reply and your scripts really make things a lot clearer for me; this is a huge step forward! I'll have to experiment and think more about it, here's just a few reactions to some of your remarks: On Sep 13, 2007, at 3:15 AM, Arthur Reutenauer wrote:
Hello Thomas,
I was waiting for someone else to answer your questions because I had no clue how to address them even if I was interested; but now I do, thanks to Hans' reply:
For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for.
I agree that this would probably be the cleanest solution: since luatex has unicode support, map everything to the corresponding Unicode characters. This would also make hyphenation easier to achieve.
Now I turn up to Hans to give us guidelines on how to define an advanced regime in Mark IV: Hans, what we need here is to replace sequences of characters by other characters, so the mapping is not one-to-one and it's more complicated than simple regimes defined by a table lookup; but I guess all we have to do is write a lua function that we could plug into the input stream reading routine (just like other regimes work).
As far as the rest of Hans' reply is concerned (Opentype features and such), I would like to add that it is a very interesting and fascinating thing to do, but definitely not what you want here, for a lot of reasons: Opentype features can be used to alter the appearance of the text, but the not nature of characters themselves. That is, if you did the transformation of your input stream at the font level, you would actually tell ConTeXt that you are handling Latin characters with a special appearance (that the font takes care of), so for example, the underlying text in a PDF would be a stream of Latin characters, and copying-and-pasting would yield Latin characters, not Greek.
The question of copy-and-paste is one of the big mysteries, and I have no clue why it works in some cases, but not in others. Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input. Never touch a running system: I just take this as some sort of divine favor and leave it at that...
That is not what you want here: you want your "a" to be understood as "alpha" and your "less-than acute-sign w vertical-bar" to be considered an "omega with dasia, varia and subscribed iota". Nor should you think of these transformations as a collection of ligatures (which act at the font level), but rather as a text encoding, just like UTF-8 is an encoding of the Unicode characters: in UTF-8 the byte sequence "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, and in the Babel input scheme for Ancient Greek the same character is encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'], hexadecimal byte 61 [ASCII 'a']".
Yes, that's crystal clear. It would also take care of another problem: in the input stream, you know exactly which character sequence translates to what. On the font level, legacy fonts sometimes have their own ideas about where to put certain glyphs.
Of course in the past, these transformations were handled at the font level and sequences like "< a" were actually ligatures, because that was all we had (and copypasting from a PDF was, mostly, doomed to fail); but we should not persist in that use now we can treat them as real Unicode characters.
Well yes, but see above.
As for your other question in your original message from September 1st (remapping single characters, for example U+03C3 to U+03F2), I have to say first that I'm not very comfortable commenting on it since I'm not quite sure what the issues are here; it may be that you have a simple variant of some character, and this you should handle at font level (some glyph being transformed into some other one); but if I am to judge by the very example you gave, I would deem this should be a part of your input regime: indeed, if every sigma is to be mapped to lunate sigma, then it probably means that the lunate sigmas are part of your character stream (even if you didn't input it directly). But I really can't give any general advice here, especially because I don't actually know what a lunate sigma really is ;-) You would have to decide for yourself as a specialist of Greek if you're dealing with really different characters or simple font variants; in the former case you should handle the transformation as a part of your regime; in the latter, by defining a font feature like Hans demonstrated.
I guess that different sorts of users would respond differently. In Unicode, there's a different slot for some alternate characters, so the Unicode standard really considers them different characters. For the classicist, a sigma is a sigma, and the fact that it can be rendered as a "lunate" or a "normal" sigma is irrelevant. For me, this makes more sense, so I would support this on the font level.
But for now, as long as it is understood that font tricks aren't the general solution for the problem at stake, I would like to demonstrate that it is still possible to do everything at font level :-)
If you have a look at the attached greek-babel.tex (and the features definition file greek-babel.fea) you will see that (almost) everything is taken care of using Opentype substitutions. You need Bosporos and GFS Baskerville to compile the file; by the way, the line with GFS Baskerville is a further proof that you shouldn't handle the transformation at font level: can you explain why it doesn't work here? As a compliment, I also attach the Perl script which I wrote to generate the .fea file.
Wonderful! I will look carefully at these files. I've been playing with perl and python all day yesterday for another problem, so I'm very much looking forward to studying your script. Thanks so much, and all best Thomas