Early shots often go wrong; I take that back; capturing 1 multibyte character actually works if you know its utf length! Just have to write the parsers for the tables now. Philipp On 2010-03-08 <07:55:06>, Philipp Gesang wrote:
On 2010-03-07 <20:07:16>, Thomas A. Schmitz wrote:
On Mar 7, 2010, at 11:54 AM, Philipp Gesang wrote:
Just one thought on your transliterator: a couple of years ago, Hans set up something a bit similar for Greek. It is based on lpeg, though, not gsub and so should be somewhat faster. If you look at context/tex/texmf-context/scripts/context/lua/mtx-babel.lua you'll see what he did. In theory, this mechanism is general, and all sorts of transliteration schemes could be hooked into it. Might give you some ideas or not...
I'm afraid lpegs, elegant though they are, would complicate the matter a bit. Try this:
\startluacode s1, s2, s3 = "abc", "äbz", "аbc" p1, p2, p3 = lpeg.P("a") , lpeg.P("ä") , lpeg.P("а") -- ^ == u1072 context(lpeg.match(p1, s1)) --> 2, correct context(lpeg.match(p2, s2)) --> 3, wrong context(lpeg.match(p3, s3)) --> 3, wrong \stopluacode
You'll see that lpeg isn't unicode-aware. On the other hand Roberto has a snippet on his page[1] that gets the unicode number out of an utf-8 octet sequence (up to 4 bytes), though I don't hasten to go this way: it would mean converting all the tables into integers, converting the input into an array of ints, then do multi-char replacement (=integer substitution) on this array and finally converting it back into sequence of chars. Not sure if transliteration of some single words is worth it.
Anyway, I'm glad you pointed me to the babel script as I hadn't noticed it before.
Philipp
[1] http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#ex
Thomas ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments