Re: [NTG-context] Greek in luatex

13 Sep 2007

      Hi Arthur,

first of all: thank you so much for your time and your expertise!  
Your reply and your scripts really make things a lot clearer for me;  
this is a huge step forward! I'll have to experiment and think more  
about it, here's just a few reactions to some of your remarks:

On Sep 13, 2007, at 3:15 AM, Arthur Reutenauer wrote:
...
Hello Thomas,
I was waiting for someone else to answer your questions because I
had no clue how to address them even if I was interested; but now I  
do,
thanks to Hans' reply:
For your general problem you need to define a new regime that will
map each relevant character sequence to the corresponding Unicode
character.  That is, you inform ConTeXt that the character stream  
it sees
is actually a way of coding another set of characters and that it can
forget the original stream.  This treatment should be done before  
any sort
of font property intervenes, because it does not depend on the
appearance of the typeset text.  That's what regimes are for.
I agree that this would probably be the cleanest solution: since  
luatex has unicode support, map everything to the corresponding  
Unicode characters. This would also make hyphenation easier to achieve.
...
Now I turn up to Hans to give us guidelines on how to define an
advanced regime in Mark IV: Hans, what we need here is to replace
sequences of characters by other characters, so the mapping is not
one-to-one and it's more complicated than simple regimes defined by a
table lookup; but I guess all we have to do is write a lua function  
that
we could plug into the input stream reading routine (just like other
regimes work).
As far as the rest of Hans' reply is concerned (Opentype features  
and
such), I would like to add that it is a very interesting and  
fascinating
thing to do, but definitely not what you want here, for a lot of
reasons: Opentype features can be used to alter the appearance of the
text, but the not nature of characters themselves.  That is, if you  
did
the transformation of your input stream at the font level, you would
actually tell ConTeXt that you are handling Latin characters with a
special appearance (that the font takes care of), so for example, the
underlying text in a PDF would be a stream of Latin characters, and
copying-and-pasting would yield Latin characters, not Greek.
The question of copy-and-paste is one of the big mysteries, and I  
have no clue why it works in some cases, but not in others. Right  
now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste  
correctly, and it does it correctly no matter if I use babel or  
Unicode input. Never touch a running system: I just take this as  
some  sort of divine favor and leave it at that...
...
That is
not what you want here: you want your "a" to be understood as "alpha"
and your "less-than acute-sign w vertical-bar" to be considered an
"omega with dasia, varia and subscribed iota".  Nor should you  
think of
these transformations as a collection of ligatures (which act at the
font level), but rather as a text encoding, just like UTF-8 is an
encoding of the Unicode characters: in UTF-8 the byte sequence
"hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the
coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA  
WITH PSILI,
and in the Babel input scheme for Ancient Greek the same character is
encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'],
hexadecimal byte 61 [ASCII 'a']".
Yes, that's crystal clear. It would also take care of another  
problem: in the input stream, you know exactly which character  
sequence translates to what. On the font level, legacy fonts  
sometimes have their own ideas about where to put certain glyphs.
...
Of course in the past, these transformations were handled at the  
font
level and sequences like "< a" were actually ligatures, because  
that was
all we had (and copypasting from a PDF was, mostly, doomed to  
fail); but
we should not persist in that use now we can treat them as real  
Unicode
characters.
Well yes, but see above.
...
As for your other question in your original message from  
September 1st
(remapping single characters, for example U+03C3 to U+03F2), I have to
say first that I'm not very comfortable commenting on it since I'm not
quite sure what the issues are here; it may be that you have a simple
variant of some character, and this you should handle at font level
(some glyph being transformed into some other one); but if I am to  
judge
by the very example you gave, I would deem this should be a part of  
your
input regime: indeed, if every sigma is to be mapped to lunate sigma,
then it probably means that the lunate sigmas are part of your  
character
stream (even if you didn't input it directly).  But I really can't  
give
any general advice here, especially because I don't actually know  
what a
lunate sigma really is ;-)  You would have to decide for yourself as a
specialist of Greek if you're dealing with really different characters
or simple font variants; in the former case you should handle the
transformation as a part of your regime; in the latter, by defining a
font feature like Hans demonstrated.
I guess that different sorts of users would respond differently. In  
Unicode, there's a different slot for some alternate characters, so  
the Unicode standard really considers them different characters. For  
the classicist, a sigma is a sigma, and the fact that it can be  
rendered as a "lunate" or a "normal" sigma is irrelevant. For me,  
this makes more sense, so  I would support this on the font level.
...
But for now, as long as it is understood that font tricks aren't the
general solution for the problem at stake, I would like to demonstrate
that it is still possible to do everything at font level :-)
If you have a look at the attached greek-babel.tex (and the features
definition file greek-babel.fea) you will see that (almost) everything
is taken care of using Opentype substitutions.  You need Bosporos and
GFS Baskerville to compile the file; by the way, the line with GFS
Baskerville is a further proof that you shouldn't handle the
transformation at font level: can you explain why it doesn't work  
here?
As a compliment, I also attach the Perl script which I wrote to  
generate
the .fea file.
Wonderful! I will look carefully at these files. I've been playing  
with perl and python all day yesterday for another problem, so I'm  
very much looking forward to studying your script.

Thanks so much, and all best

Thomas

Re: [NTG-context] Greek in luatex

Thomas A. Schmitz