Greek in luatex

newer
Interactive slides---hidden field...

Thomas A. Schmitz

1 Sep 2007 1 Sep '07

12:56 p.m.

Hi all, I've been experimenting with my Greek stuff in luatex, and I think I'm making nice progress. Things pretty much work with Unicode input, and as soon as the kerning problem is solved, I'm very optimistic. Two questions came up for me; I assume the answers are straightforward, but couldn't find anything: 1. How can I remap single characters? Let's say that we have a Unicode character in the input stream that maps to 0x03c3, but I want it remapped to 0x3f2, how can this be achieved? 2. Similarly: if I want to support the legacy input method babel, I need to remap the input stream to the Greek characters (question 1) and also need to feed the font some ligature rules, such as: the combination >a needs to be combined into the character 0x1f00. What would be the syntax and the way to do this? All best Thomas

Show replies by date

Thomas A. Schmitz

11 Sep 11 Sep

8:47 a.m.

OK, the message below didn't get too many responses, so maybe I can rephrase my quiestions in a more precise manner: 1. For otftotfm, there's the "unicoding" command where you can replace a character in a certain slot with another unicode character, so you could say unicoding "A = uni03D1" Is anything like this possible in luatex? 2. I see this code in font-otf.lua: fonts.otf.features.data.tex = { { "endash", "hyphen hyphen" }, { "emdash", "hyphen hyphen hyphen" }, { "quotedblleft", "quoteleft quoteleft" }, { "quotedblright", "quoteright quoteright" }, { "quotedblleft", "grave grave" }, { "quotedblright", "quotesingle quotesingle" }, { "quotedblbase", "comma comma" } } and this list is used in the function function fonts.initializers.base.otf.texligatures(tfm,value) How is it possible to write a similar list and function for just a single font or fonts in a specific typescript? Thanks a lot! Thomas On Sep 1, 2007, at 12:56 PM, Thomas A. Schmitz wrote:

...

Hi all,

I've been experimenting with my Greek stuff in luatex, and I think I'm making nice progress. Things pretty much work with Unicode input, and as soon as the kerning problem is solved, I'm very optimistic. Two questions came up for me; I assume the answers are straightforward, but couldn't find anything:

1. How can I remap single characters? Let's say that we have a Unicode character in the input stream that maps to 0x03c3, but I want it remapped to 0x3f2, how can this be achieved?

2. Similarly: if I want to support the legacy input method babel, I need to remap the input stream to the Greek characters (question 1) and also need to feed the font some ligature rules, such as: the combination >a needs to be combined into the character 0x1f00. What would be the syntax and the way to do this?

All best

Thomas

Hans Hagen

12:12 p.m.

Thomas A. Schmitz wrote:

...

OK, the message below didn't get too many responses, so maybe I can rephrase my quiestions in a more precise manner:

1. For otftotfm, there's the "unicoding" command where you can replace a character in a certain slot with another unicode character, so you could say unicoding "A = uni03D1" Is anything like this possible in luatex?

sure, but it depends a bit on what level ... font driven or not if you have open type fonts, you can add features on the fly ... \starttext \installfontfeature[otf][verb] \definefontfeature [test] [mode=node,language=dflt,script=latn, verb=yes,featurefile=verbose-digits.fea] {\font\test=name:lmroman10regular*test at 20pt \test 1 2 3 4} \ctxlua{characters.context.show(\number"00AB)} \stoptext this replaces 1 by one and 2 by two ... the file verbose-digits.fea in in the distribution and an example of a fontforge specification file

...

2. I see this code in font-otf.lua: fonts.otf.features.data.tex = { { "endash", "hyphen hyphen" }, { "emdash", "hyphen hyphen hyphen" }, { "quotedblleft", "quoteleft quoteleft" }, { "quotedblright", "quoteright quoteright" }, { "quotedblleft", "grave grave" }, { "quotedblright", "quotesingle quotesingle" }, { "quotedblbase", "comma comma" } }

that's ligatures and there for backward compatibility (hm, makes me wonder if it makes more sense to do that using feature files)

...

and this list is used in the function function fonts.initializers.base.otf.texligatures(tfm,value)

How is it possible to write a similar list and function for just a single font or fonts in a specific typescript?

in principle you can add lua code in typescripts and then register that as a feature (so, texligatures or tlig is one of them, as is lineheight) it all depends on how generic things are; we can think of features like remap=name-of-remap-vector (keep in mind that this operates on node lists then; rencoding the input i.e. regimes is done differently) so .. just write down detailed specs -) Han ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Arthur Reutenauer

13 Sep 13 Sep

3:15 a.m.

Hello Thomas, I was waiting for someone else to answer your questions because I had no clue how to address them even if I was interested; but now I do, thanks to Hans' reply: For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for. Now I turn up to Hans to give us guidelines on how to define an advanced regime in Mark IV: Hans, what we need here is to replace sequences of characters by other characters, so the mapping is not one-to-one and it's more complicated than simple regimes defined by a table lookup; but I guess all we have to do is write a lua function that we could plug into the input stream reading routine (just like other regimes work). As far as the rest of Hans' reply is concerned (Opentype features and such), I would like to add that it is a very interesting and fascinating thing to do, but definitely not what you want here, for a lot of reasons: Opentype features can be used to alter the appearance of the text, but the not nature of characters themselves. That is, if you did the transformation of your input stream at the font level, you would actually tell ConTeXt that you are handling Latin characters with a special appearance (that the font takes care of), so for example, the underlying text in a PDF would be a stream of Latin characters, and copying-and-pasting would yield Latin characters, not Greek. That is not what you want here: you want your "a" to be understood as "alpha" and your "less-than acute-sign w vertical-bar" to be considered an "omega with dasia, varia and subscribed iota". Nor should you think of these transformations as a collection of ligatures (which act at the font level), but rather as a text encoding, just like UTF-8 is an encoding of the Unicode characters: in UTF-8 the byte sequence "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, and in the Babel input scheme for Ancient Greek the same character is encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'], hexadecimal byte 61 [ASCII 'a']". Of course in the past, these transformations were handled at the font level and sequences like "< a" were actually ligatures, because that was all we had (and copypasting from a PDF was, mostly, doomed to fail); but we should not persist in that use now we can treat them as real Unicode characters. As for your other question in your original message from September 1st (remapping single characters, for example U+03C3 to U+03F2), I have to say first that I'm not very comfortable commenting on it since I'm not quite sure what the issues are here; it may be that you have a simple variant of some character, and this you should handle at font level (some glyph being transformed into some other one); but if I am to judge by the very example you gave, I would deem this should be a part of your input regime: indeed, if every sigma is to be mapped to lunate sigma, then it probably means that the lunate sigmas are part of your character stream (even if you didn't input it directly). But I really can't give any general advice here, especially because I don't actually know what a lunate sigma really is ;-) You would have to decide for yourself as a specialist of Greek if you're dealing with really different characters or simple font variants; in the former case you should handle the transformation as a part of your regime; in the latter, by defining a font feature like Hans demonstrated. But for now, as long as it is understood that font tricks aren't the general solution for the problem at stake, I would like to demonstrate that it is still possible to do everything at font level :-) If you have a look at the attached greek-babel.tex (and the features definition file greek-babel.fea) you will see that (almost) everything is taken care of using Opentype substitutions. You need Bosporos and GFS Baskerville to compile the file; by the way, the line with GFS Baskerville is a further proof that you shouldn't handle the transformation at font level: can you explain why it doesn't work here? As a compliment, I also attach the Perl script which I wrote to generate the .fea file. Arthur

Taco Hoekwater

9:03 a.m.

Arthur Reutenauer wrote:

...

Hello Thomas,

I was waiting for someone else to answer your questions because I had no clue how to address them even if I was interested; but now I do, thanks to Hans' reply:

For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for.

Yes, except that we need a more powerful version (almost like OTPs) if we want to handle transcriptions properly. The vital point is that it should operate on tokens, not on nodes. I am not sure if Hans already has a hook there that can be extended.

...

If you have a look at the attached greek-babel.tex (and the features definition file greek-babel.fea) you will see that (almost) everything is taken care of using Opentype substitutions. You need Bosporos and GFS Baskerville to compile the file; by the way, the line with GFS Baskerville is a further proof that you shouldn't handle the transformation at font level: can you explain why it doesn't work here?

Possibly because a single one of the glyphs has a different name in GFS Baskerville, or because a previous gsub rule has e.g. replaced F;i; => Fi; (your own gsub rules are always executed last, after everything defined by the font) As you say, .fea's are definately not the right way to handle this, even if they would work flawlessly.

Arthur Reutenauer

12:24 p.m.

...

Yes, except that we need a more powerful version (almost like OTPs) if we want to handle transcriptions properly. The vital point is that it should operate on tokens, not on nodes.

Yes, sure. OTP would work fine here, but I thought Mark IV had already something handy.

...

Possibly because a single one of the glyphs has a different name in GFS Baskerville, or because a previous gsub rule has e.g. replaced F;i; => Fi;

No, simply because GFS Baskerville has no glyphs for Latin characters, so they're dropped by the token reader and can't be transformed afterwards! Arthur

Taco Hoekwater

1:38 p.m.

Arthur Reutenauer wrote:

...

...
Yes, except that we need a more powerful version (almost like OTPs) if we want to handle transcriptions properly. The vital point is that it should operate on tokens, not on nodes.

Yes, sure. OTP would work fine here, but I thought Mark IV had already something handy.

I played a bit, see attachment. Surely Hans will want to improve on this interface, so don't patch any of the core files just now. Best wishes, Taco

Thomas A. Schmitz

2:54 p.m.

On Sep 13, 2007, at 1:38 PM, Taco Hoekwater wrote:

...

Arthur Reutenauer wrote:

...
...
Yes, except that we need a more powerful version (almost like OTPs) if we want to handle transcriptions properly. The vital point is that it should operate on tokens, not on nodes. Yes, sure. OTP would work fine here, but I thought Mark IV had already something handy.

I played a bit, see attachment. Surely Hans will want to improve on this interface, so don't patch any of the core files just now.

Best wishes, Taco

Taco, it almost feels like today's my birthday - thanks again! Will look at it more closely soonish! Best Thomas

Arthur Reutenauer

8:36 p.m.

...

I played a bit, see attachment. Surely Hans will want to improve on this interface, so don't patch any of the core files just now.

Fantastic! Now I played a bit with your file myself, and compared with the behaviour of an OTP which has the same action: you can see that macros arguments between square brackets are preserved by OTP, whereas your function (obviously) converts everything unconditionally. How difficult would it be to program the same behaviour, that is, make collectors.handle pass to convert_babel only contiguous ranges of characters that are situated outside matching brackets?

Hans Hagen

8:49 p.m.

Arthur Reutenauer wrote:

...

...
I played a bit, see attachment. Surely Hans will want to improve on this interface, so don't patch any of the core files just now.

Fantastic!

Now I played a bit with your file myself, and compared with the behaviour of an OTP which has the same action: you can see that macros arguments between square brackets are preserved by OTP, whereas your function (obviously) converts everything unconditionally. How difficult would it be to program the same behaviour, that is, make collectors.handle pass to convert_babel only contiguous ranges of characters that are situated outside matching brackets?

i'll wrap tacos macro up a bit however, dealing with things like \blank[whatever] is not trivial (1) we need to prevent expansion (register feature) (2) but sometimes we need to expand (3) and not all commands are treated the same this is why otp liek things are suboptimal also, a proper toks handling mechanism should look at its neighbours Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Hans Hagen

9:24 p.m.

Arthur Reutenauer wrote: ... greek ... greek ... new beta \defineremapper[babelgreek] \remapcharacter[babelgreek][`a]{\alpha} \remapcharacter[babelgreek][`b]{\beta} \remapcharacter[babelgreek][`c]{\gamma} \remapcharacter[babelgreek][`d]{OEPS} \starttext [\startbabelgreek a b c some stuff here \blank[big] oeps b d \stopbabelgreek] [\babelgreek{some stuff here}] \stoptext i can think of a more clever mechanism (have some ideas) but not now (in the middle of something else) for arthur ... [] are skipped for mojca ... this beta also fixes your accent problem (if she's in the mood for source browsing ... interesting solution) for luigi ... working on a variant xml parser ... now loading 40 meg in 5 seconds for taco ... i made your example into a configurable one Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Arthur Reutenauer

9:45 p.m.

...

for arthur ... [] are skipped

Thanks! I guess there's more to it and token filtering is not the only way to do it, but it's still great. Arthur

Hans Hagen

10:20 p.m.

Arthur Reutenauer wrote:

...

...
for arthur ... [] are skipped

Thanks! I guess there's more to it and token filtering is not the only way to do it, but it's still great.

indeed, also, its' important to look fresh at these things an dforget about how we do things now, else we replace hack with hack Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Arthur Reutenauer

14 Sep 14 Sep

2:24 a.m.

...

indeed, also, its' important to look fresh at these things an dforget about how we do things now, else we replace hack with hack

Sure, of course. I only thought this was a nice way of handling things but I'm not settled on that. Arthur

Thomas A. Schmitz

13 Sep 13 Sep

10:38 p.m.

On Sep 13, 2007, at 9:45 PM, Arthur Reutenauer wrote:

...

Thanks! I guess there's more to it and token filtering is not the only way to do it, but it's still great.

Arthur

Oh boy... I'm afraid I lost you there. Hans, your remapper looks just like the thing I'd need for my Greek stuff. Right now, there appears to be a slight problem with the pdfs I produce with this code: on my system (OS X), they freeze or crash most pdf viewers (Adobe Reader can handle them, preview, TeXShop and pdfview all crash or freeze). Arthur, I also played with your fontfeatures. Most of the substitutions work, but there were a couple of problems that I just couldn't resolve, especially regarding the characters with an iota subscript: combinations involving accents and breathing (such as

...

~h|) were remapped correctly; the pure vowel + iota (h|) was not remapped. I guess I will wait till the dust settles a bit and you tell me which is the best way to pursue.

Taco, one question: Hans mentioned that support for "wide" postscript fonts via afm was not supported yet. Does that mean that type 1 fonts with a unicode encoding do not work yet? Thanks so much, all best Thomas

Hans Hagen

11:05 p.m.

Thomas A. Schmitz wrote:

...

Taco, one question: Hans mentioned that support for "wide" postscript fonts via afm was not supported yet. Does that mean that type 1 fonts with a unicode encoding do not work yet?

the latest mkiv works ok with wide fonts, the latest luatex also, but best wait till begin next week when all subsetting issues are resolved Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Taco Hoekwater

11:52 p.m.

Hans Hagen wrote:

...

Thomas A. Schmitz wrote:

...
Taco, one question: Hans mentioned that support for "wide" postscript fonts via afm was not supported yet. Does that mean that type 1 fonts with a unicode encoding do not work yet?

the latest mkiv works ok with wide fonts, the latest luatex also, but best wait till begin next week when all subsetting issues are resolved

Like the man says. Best wishes, Taco PS It is amazing how Hans manages to answer questions to me before I even see them! All ntg-context mail arrives completely out of order and hours late, today.

Arthur Reutenauer

16 Sep 16 Sep

1:22 a.m.

...

there were a couple of problems that I just couldn't resolve, especially regarding the characters with an iota subscript:

Indeed. This is a problem with the Fontforge code applying the GSUB features: the 'grbl' feature is defined by two lookups, one being a list of single substitutions (h -> eta) and the other a list of ligature substitutions (h bar -> eta with subscribed iota). Now, since the latter has to take precedence to avoid conflicts, I explicitely put it before the other, but it seems that Fontforge ignores this and applies the list of single substitutions before the other (this is confirmed by the cache file BosporosU@greek-babel.tma where the lookup with the single substitutions, called "GreelBabelLookupSimple", appears first in the gsub table). Note that this doesn't happen for substitutions *inside* a lookup (so things like "greater eta bar" and "eta bar" don't conflict since they're both ligature substitutions and I put the former before in the list, and the substitutions are correctly applied. As a far as I understand, this behaviour is actually compliant with the Opentype specifications and is quite widespread among typesetting engines and so it is not (only) Fontforge's fault; but, needless to say, it is nevertheless annoying. (In more crude terms: Opentype does not specify anything in that respect, so manufacturers of typesetting software can do whatever they want ...) Thomas: to solve the problem at hand, you can try the new feature file I send along with a small test (I simply define a new feature that is to be applied after 'grbl', to deal specifically with the subscribed iotas). Arthur

Taco Hoekwater

8:56 a.m.

Arthur Reutenauer wrote:

...

As a far as I understand, this behaviour is actually compliant with the Opentype specifications and is quite widespread among typesetting engines and so it is not (only) Fontforge's fault; but, needless to say, it is nevertheless annoying. (In more crude terms: Opentype does not specify anything in that respect, so manufacturers of typesetting software can do whatever they want ...)

The specification says that lookups should be applied in LookupList order. Featurefiles don't have an explicit ordering command, but that does not mean that ordering should be irrelevant. So I think this is a bug in the version of fontforge I am using in luatex. I will do some testing. Best wishes, Taco

Taco Hoekwater

10:22 a.m.

Hi guys, Try this ordering:

...

lookup GreekBabelLookupMultiple { ... } GreekBabelLookupMultiple ;

lookup GreekBabelLookupSimple { ... } GreekBabelLookupSimple ;

Best wishes, Taco

Thomas A. Schmitz

3:01 p.m.

Hi Arthur, Taco, you're my heroes! Changing the order of the lookup tables in the .fea file actually took care of the problem. Thanks for looking into this, now I get the results I was expecting; every substitution is applied to the font! Once the initial lookup has been done, this is reasonably fast, too, so I like it. I'm eagerly waiting for teh new release next week to see if this works with copy-and-past from pdfs. So this appears to be one way to deal with ASCII input a la babel. Easy to implement, but fails on fonts that don't have the glyphs for the Latin characters. One trivial question: when I want to experiment with feature files, the cached instance of the font seems to be in the way. Only after deleting the current luatex-cache, regenerating it and recompiling the format do I get proper results. Is there an easier/faster way to do this? Will now go on and experiment some more, especially with type1/afm- based fonts. Thanks a lot, best wishes Thomas On Sep 16, 2007, at 10:22 AM, Taco Hoekwater wrote:

...

Hi guys,

Try this ordering:

...
lookup GreekBabelLookupMultiple { ... } GreekBabelLookupMultiple ;

lookup GreekBabelLookupSimple { ... } GreekBabelLookupSimple ;

Hans Hagen

17 Sep 17 Sep

1:12 a.m.

Thomas A. Schmitz wrote:

...

Hi Arthur, Taco,

you're my heroes! Changing the order of the lookup tables in the .fea file actually took care of the problem. Thanks for looking into this, now I get the results I was expecting; every substitution is applied to the font! Once the initial lookup has been done, this is reasonably fast, too, so I like it. I'm eagerly waiting for teh new release next week to see if this works with copy-and-past from pdfs. So this appears to be one way to deal with ASCII input a la babel. Easy to implement, but fails on fonts that don't have the glyphs for the Latin characters.

arthur mentions the final sigma in the fea file .. can be a (part of) feature too (like fina)

...

One trivial question: when I want to experiment with feature files, the cached instance of the font seems to be in the way. Only after deleting the current luatex-cache, regenerating it and recompiling the format do I get proper results. Is there an easier/faster way to do this?

jumping the version number of the otf handler will force this, but this is a bad idea; also, caching is fast because no file checking has to be done, so deleting cached files (just the one you test) is the price you pay when developing a font (fea) file btw, the fea file can be part of the distribution (but we need to think of a naming scheme) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Arthur Reutenauer

16 Sep 16 Sep

3:08 p.m.

...

Try this ordering:

Yes, it works. So Fontforge is sensitive to the order in which the lookups are defined in the file? Interesting ... Thomas, you can try this but I have made a mistake in the Unicode code for omega with subscribed iota: it should be 1FF3 and not 1FD3. Arthur

Thomas A. Schmitz

3:44 p.m.

On Sep 16, 2007, at 3:08 PM, Arthur Reutenauer wrote:

...

Yes, it works. So Fontforge is sensitive to the order in which the lookups are defined in the file? Interesting ...

Thomas, you can try this but I have made a mistake in the Unicode code for omega with subscribed iota: it should be 1FF3 and not 1FD3.

Arthur

Yep, I had already fixed that (and also replied to Taco's message, the context list is again a bit out of order today). Arthur, while we're at it: could you try and insert this line into the fea-file: sub quotedbl quotesingle i by un1FD3 ; whenever I try anything like this with the quotedbl character (which produces some ligatures), I get this error: Fatal error occurred, no output PDF file produced! (Or similar errors with other fonts). The mechanism for the single dieresis works: sub quotedbl i by uni03CA ; but nothing with quotedbl + something else. Do you have any ideal what triggers this error? Best Thomas

Hans Hagen

17 Sep 17 Sep

10:48 a.m.

Hi Arthur and Thomas, i've put the greek file in the distribution (fea path), do we also need this babel stuff for "u and such? we should start thinking about a set of predefined features Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Hans Hagen

13 Sep 13 Sep

7:42 p.m.

Taco Hoekwater wrote:

...

Yes, except that we need a more powerful version (almost like OTPs) if we want to handle transcriptions properly. The vital point is that it should operate on tokens, not on nodes. I am not sure if Hans already has a hook there that can be extended.

there are hooks, but i want to avoid token processign as much as possible beause it's slow (so it can definitely not be -as with nodes- done on all the data, i must give it some thought .. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Thomas A. Schmitz

11:45 a.m.

Hi Arthur, first of all: thank you so much for your time and your expertise! Your reply and your scripts really make things a lot clearer for me; this is a huge step forward! I'll have to experiment and think more about it, here's just a few reactions to some of your remarks: On Sep 13, 2007, at 3:15 AM, Arthur Reutenauer wrote:

...

Hello Thomas,

I was waiting for someone else to answer your questions because I had no clue how to address them even if I was interested; but now I do, thanks to Hans' reply:

For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for.

I agree that this would probably be the cleanest solution: since luatex has unicode support, map everything to the corresponding Unicode characters. This would also make hyphenation easier to achieve.

...

Now I turn up to Hans to give us guidelines on how to define an advanced regime in Mark IV: Hans, what we need here is to replace sequences of characters by other characters, so the mapping is not one-to-one and it's more complicated than simple regimes defined by a table lookup; but I guess all we have to do is write a lua function that we could plug into the input stream reading routine (just like other regimes work).

As far as the rest of Hans' reply is concerned (Opentype features and such), I would like to add that it is a very interesting and fascinating thing to do, but definitely not what you want here, for a lot of reasons: Opentype features can be used to alter the appearance of the text, but the not nature of characters themselves. That is, if you did the transformation of your input stream at the font level, you would actually tell ConTeXt that you are handling Latin characters with a special appearance (that the font takes care of), so for example, the underlying text in a PDF would be a stream of Latin characters, and copying-and-pasting would yield Latin characters, not Greek.

The question of copy-and-paste is one of the big mysteries, and I have no clue why it works in some cases, but not in others. Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input. Never touch a running system: I just take this as some sort of divine favor and leave it at that...

...

That is not what you want here: you want your "a" to be understood as "alpha" and your "less-than acute-sign w vertical-bar" to be considered an "omega with dasia, varia and subscribed iota". Nor should you think of these transformations as a collection of ligatures (which act at the font level), but rather as a text encoding, just like UTF-8 is an encoding of the Unicode characters: in UTF-8 the byte sequence "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, and in the Babel input scheme for Ancient Greek the same character is encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'], hexadecimal byte 61 [ASCII 'a']".

Yes, that's crystal clear. It would also take care of another problem: in the input stream, you know exactly which character sequence translates to what. On the font level, legacy fonts sometimes have their own ideas about where to put certain glyphs.

...

Of course in the past, these transformations were handled at the font level and sequences like "< a" were actually ligatures, because that was all we had (and copypasting from a PDF was, mostly, doomed to fail); but we should not persist in that use now we can treat them as real Unicode characters.

Well yes, but see above.

...

As for your other question in your original message from September 1st (remapping single characters, for example U+03C3 to U+03F2), I have to say first that I'm not very comfortable commenting on it since I'm not quite sure what the issues are here; it may be that you have a simple variant of some character, and this you should handle at font level (some glyph being transformed into some other one); but if I am to judge by the very example you gave, I would deem this should be a part of your input regime: indeed, if every sigma is to be mapped to lunate sigma, then it probably means that the lunate sigmas are part of your character stream (even if you didn't input it directly). But I really can't give any general advice here, especially because I don't actually know what a lunate sigma really is ;-) You would have to decide for yourself as a specialist of Greek if you're dealing with really different characters or simple font variants; in the former case you should handle the transformation as a part of your regime; in the latter, by defining a font feature like Hans demonstrated.

I guess that different sorts of users would respond differently. In Unicode, there's a different slot for some alternate characters, so the Unicode standard really considers them different characters. For the classicist, a sigma is a sigma, and the fact that it can be rendered as a "lunate" or a "normal" sigma is irrelevant. For me, this makes more sense, so I would support this on the font level.

...

But for now, as long as it is understood that font tricks aren't the general solution for the problem at stake, I would like to demonstrate that it is still possible to do everything at font level :-)

If you have a look at the attached greek-babel.tex (and the features definition file greek-babel.fea) you will see that (almost) everything is taken care of using Opentype substitutions. You need Bosporos and GFS Baskerville to compile the file; by the way, the line with GFS Baskerville is a further proof that you shouldn't handle the transformation at font level: can you explain why it doesn't work here? As a compliment, I also attach the Perl script which I wrote to generate the .fea file.

Wonderful! I will look carefully at these files. I've been playing with perl and python all day yesterday for another problem, so I'm very much looking forward to studying your script. Thanks so much, and all best Thomas

Arthur Reutenauer

12:49 p.m.

...

Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input.

You mean with LuaTeX? Copypasting isn't supported yet in LuaTeX so it's no surprise that it wouldn't work (for me Adobe Reader and Preview fail in two different ways). As for pdfTeX I leave that to Taco and others to answer. But hyphenation is another important issue, maybe even clearer.

...

I guess that different sorts of users would respond differently. In Unicode, there's a different slot for some alternate characters, so the Unicode standard really considers them different characters.

Actually, now I think about it, the name for U+03F2 has "symbol" in it, and that's a clear indication that the character is intended for "technical use", not for inputting Greek text; so your choice is consistent with the intent of the Standard.

...

Wonderful! I will look carefully at these files. I've been playing with perl and python all day yesterday for another problem, so I'm very much looking forward to studying your script.

Somewhere in the middle of writing it, I realized that I should have written it in Lua :-) It wouldn't have been much different. Arthur

Thomas A. Schmitz

2:51 p.m.

On Sep 13, 2007, at 12:49 PM, Arthur Reutenauer wrote:

...

You mean with LuaTeX? Copypasting isn't supported yet in LuaTeX so it's no surprise that it wouldn't work (for me Adobe Reader and Preview fail in two different ways). As for pdfTeX I leave that to Taco and others to answer.

But hyphenation is another important issue, maybe even clearer.

Yes, I meant in pdfTeX, sorry for being imprecise.

...

Actually, now I think about it, the name for U+03F2 has "symbol" in it, and that's a clear indication that the character is intended for "technical use", not for inputting Greek text; so your choice is consistent with the intent of the Standard.

OK, good to hear that. I now realize that much of the stuff that I hacked together for use with pdfTeX worked by dumb luck; with luaTeX, I'll be forced to be adhere to standards more closely. I guess that's a good thing...

...

Somewhere in the middle of writing it, I realized that I should have written it in Lua :-) It wouldn't have been much different.

Yes, I'm hoping to look into lua as well. Thanks so much! Thomas

Taco Hoekwater

4:25 p.m.

Arthur Reutenauer wrote:

...

...
Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input.

You mean with LuaTeX? Copypasting isn't supported yet in LuaTeX so it's no surprise that it wouldn't work (for me Adobe Reader and Preview fail in two different ways). As for pdfTeX I leave that to Taco and others to answer.

The next luatex release will finally have support for cut&paste when using opentype and truetype fonts. In pdftex, cut&paste for traditional type1 fonts was already present, and that will continue to work as it did (at least for the immediate future). Best wishes, Taco

Hans Hagen

7:51 p.m.

Thomas A. Schmitz wrote:

...

...
For your general problem you need to define a new regime that will map each relevant character sequence to the corresponding Unicode character. That is, you inform ConTeXt that the character stream it sees is actually a way of coding another set of characters and that it can forget the original stream. This treatment should be done before any sort of font property intervenes, because it does not depend on the appearance of the typeset text. That's what regimes are for.

regimes are a solution, but what solution is best depends on the input stream ... whole document? partial document? also written to external files? evenually everything can become a unicode, (private aereas) and as such travel through the system; of we can misuse virtual fonts ...

...

...
we could plug into the input stream reading routine (just like other regimes work).

there are mechanisms for that (because that's what i played al lot with last year; there was (maybe even is) a mechanism for chained processing of input etc

...

...
actually tell ConTeXt that you are handling Latin characters with a special appearance (that the font takes care of), so for example, the underlying text in a PDF would be a stream of Latin characters, and copying-and-pasting would yield Latin characters, not Greek.

not entirely true ... we can (and do) intercept the node stream ... ok, at that point we're dealing with a font/char pair, but we can chang ethe char (or node) to whatever we like ... depends on the problem

...

The question of copy-and-paste is one of the big mysteries, and I have no clue why it works in some cases, but not in others. Right now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste correctly, and it does it correctly no matter if I use babel or Unicode input. Never touch a running system: I just take this as some sort of divine favor and leave it at that...

that's a matter of associating tounicode points, of course, no unicode means no copy/paste -)

...

...
That is not what you want here: you want your "a" to be understood as "alpha" and your "less-than acute-sign w vertical-bar" to be considered an "omega with dasia, varia and subscribed iota". Nor should you think of these transformations as a collection of ligatures (which act at the font level), but rather as a text encoding, just like UTF-8 is an encoding of the Unicode characters: in UTF-8 the byte sequence "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, and in the Babel input scheme for Ancient Greek the same character is encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'], hexadecimal byte 61 [ASCII 'a']".

Yes, that's crystal clear. It would also take care of another problem: in the input stream, you know exactly which character sequence translates to what. On the font level, legacy fonts sometimes have their own ideas about where to put certain glyphs.

depends ... the input char becomes a node, now, if (probably controlled by attributes) a certain char is sees (say 'a') and you want it to be an alpha, well, we can change that char then in the node,

...

...
Of course in the past, these transformations were handled at the font level and sequences like "< a" were actually ligatures, because that was all we had (and copypasting from a PDF was, mostly, doomed to fail); but we should not persist in that use now we can treat them as real Unicode characters.

those hard coded mechanism were indeed not sufficient Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

6514

Age (days ago)

6530

Last active (days ago)

List overview

Download

30 comments

4 participants

participants (4)

Arthur Reutenauer
Hans Hagen
Taco Hoekwater
Thomas A. Schmitz

Greek in luatex

tags

participants (4)