Dear listmates, a while ago I made some lua tables to aid myself in transliteration of various scripts. I thought this could be of general interest so I built a module from it and threw in some source documentation and a tiny manual, too. Right now it contains modes for Cyrillic, Glagolitic and Greek scripts, older variants included, full ISO 9 support, and two transcription modes as well. (There's a showcase of examples in the manual.) It allows global setups and local adjustments. It splendidly works for me, but maybe somebody wants to review it before I post it on the list, so whom should I send it to? Philipp -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Am 07.03.10 11:54, schrieb Philipp Gesang:
Dear listmates,
a while ago I made some lua tables to aid myself in transliteration of various scripts. I thought this could be of general interest so I built a module from it and threw in some source documentation and a tiny manual, too.
Right now it contains modes for Cyrillic, Glagolitic and Greek scripts, older variants included, full ISO 9 support, and two transcription modes as well. (There's a showcase of examples in the manual.) It allows global setups and local adjustments.
It splendidly works for me, but maybe somebody wants to review it before I post it on the list, so whom should I send it to?
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Wolfgang
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Here you are. I included a pdf of the manual as not everybody will have
On 2010-03-07 <12:04:36>, Wolfgang Schuster wrote: the required fonts to build it. Philipp
Am 07.03.10 12:46, schrieb Philipp Gesang:
On 2010-03-07<12:04:36>, Wolfgang Schuster wrote:
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module.
Here you are. I included a pdf of the manual as not everybody will have the required fonts to build it.
No files! Wolfgang
On 2010-03-07 <12:59:55>, Wolfgang Schuster wrote:
Am 07.03.10 12:46, schrieb Philipp Gesang:
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Here you are. I included a pdf of the manual as not everybody will have
On 2010-03-07<12:04:36>, Wolfgang Schuster wrote: the required fonts to build it. No files!
Date: Sun, 07 Mar 2010 12:48:13 +0100 From: ntg-context-bounces@ntg.nl To: pgesang@ix.urz.uni-heidelberg.de Subject: Your message to ntg-context awaits moderator approval We'll have to wait a bit.
Wolfgang
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Philipp Gesang wrote:
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Here you are. I included a pdf of the manual as not everybody will have the required fonts to build it. No files!
Date: Sun, 07 Mar 2010 12:48:13 +0100 From: ntg-context-bounces@ntg.nl To: pgesang@ix.urz.uni-heidelberg.de Subject: Your message to ntg-context awaits moderator approval
We'll have to wait a bit.
A bit too large for the mailing list. I extracted and rezipped the attachment, then uploaded to http://wiki.contextgarden.net/images/4/42/Transliterator.zip Best wishes, Taco
On 2010-03-07 <14:09:07>, Taco Hoekwater wrote:
Philipp Gesang wrote:
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Here you are. I included a pdf of the manual as not everybody will have the required fonts to build it. No files!
Date: Sun, 07 Mar 2010 12:48:13 +0100 From: ntg-context-bounces@ntg.nl To: pgesang@ix.urz.uni-heidelberg.de Subject: Your message to ntg-context awaits moderator approval
We'll have to wait a bit.
A bit too large for the mailing list. I extracted and rezipped the attachment, then uploaded to
http://wiki.contextgarden.net/images/4/42/Transliterator.zip Sorry, my fault, I sent the repository as well …
Philipp
Best wishes, Taco ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Philipp Gesang wrote:
On 2010-03-07 <14:09:07>, Taco Hoekwater wrote:
Philipp Gesang wrote:
There is nothing wrong to send it to the list or provide the files online, this way more people can give comments to your module. Here you are. I included a pdf of the manual as not everybody will have the required fonts to build it. No files! Date: Sun, 07 Mar 2010 12:48:13 +0100 From: ntg-context-bounces@ntg.nl To: pgesang@ix.urz.uni-heidelberg.de Subject: Your message to ntg-context awaits moderator approval
We'll have to wait a bit. A bit too large for the mailing list. I extracted and rezipped the attachment, then uploaded to
http://wiki.contextgarden.net/images/4/42/Transliterator.zip Sorry, my fault, I sent the repository as well …
No problem. I don't know any cyrillic, but I had a quick look and have a few remarks for you to consider (I weeded out the comments already made by Wolfgang): * \loadmarkfile takes braces, not brackets for the argument, so you should use \loadmarkfile{t-transliterator}. * You should create t-transliterator.mkii, even (especially) when it does nothing except give a \message{Module is unsupported under mkii} followed by \endinput. * besides splitting into mkiv and lua, you may even want to create separate files for each transliteration that is then loaded by filename key. This would make it easier for other people to add transliterations you did not implement yourself. * I expect the substitutions themselves can be sped up a little with no effort, by using the normal string.gsub. That function is 8bit clean, so there is no need for the added complication of utf8 processing. * It is common in ConTeXt to also provide a \starttransliteration ... \stoptransliteration pair (but there is no requirement). * Finally, where should I get the CMU fonts from? Best wishes, Taco
Am 07.03.10 14:56, schrieb Taco Hoekwater:
* Finally, where should I get the CMU fonts from? http://canopus.iacp.dvo.ru/~panov/cm-unicode/ http://canopus.iacp.dvo.ru/%7Epanov/cm-unicode/
Wolfgang
Hi Taco, On 2010-03-07 <14:56:52>, Taco Hoekwater wrote:
I don't know any cyrillic, but I had a quick look and have a few remarks for you to consider (I weeded out the comments already made by Wolfgang):
* \loadmarkfile takes braces, not brackets for the argument, so you should use \loadmarkfile{t-transliterator}. Done!
* You should create t-transliterator.mkii, even (especially) when it does nothing except give a \message{Module is unsupported under mkii} followed by \endinput. Done!
* besides splitting into mkiv and lua, you may even want to create separate files for each transliteration that is then loaded by filename key. This would make it easier for other people to add transliterations you did not implement yourself. As I already wrote in my reply to Wolfgang: eventually I'll do this but I'm not yet sure how this will look like exactly.
* I expect the substitutions themselves can be sped up a little with no effort, by using the normal string.gsub. That function is 8bit clean, so there is no need for the added complication of utf8 processing. You're right again, I measured 33% shorter execution time in an extreme case with string.gsub; done!
* It is common in ConTeXt to also provide a \starttransliteration ... \stoptransliteration pair (but there is no requirement). I noticed that and of course I would like to provide an environment but until now I had no luck implementing it; I'm stuck at the following:
\def\starttransliterate {% \bgroup\dostarttransliterate% } \def\stoptransliterate {% \egroup \transliterate{% \getbuffer[trl]% }% } \def\dostarttransliterate{% \dostartbuffer[trl][starttransliterate][stoptransliterate]% } and the buffer content isn't expanded before it is passed to lua; adding \@EA at various places has no effect either.
* Finally, where should I get the CMU fonts from? Wolfgang already posted the link but CMU is buggy in some cases so to compile the manuall you'll need this font, too: http://kodeks.uni-bamberg.de/AKSL/Schrift/BukyVede.htm
I'll post a new revision of the module during the week Many thanks to you, too! Philipp
Best wishes, Taco
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Am 07.03.10 17:59, schrieb Philipp Gesang:
* It is common in ConTeXt to also provide a \starttransliteration
... \stoptransliteration pair (but there is no requirement).
I noticed that and of course I would like to provide an environment but until now I had no luck implementing it; I'm stuck at the following:
\def\starttransliterate {% \bgroup\dostarttransliterate% }
\def\stoptransliterate {% \egroup \transliterate{% \getbuffer[trl]% }% }
\def\dostarttransliterate{% \dostartbuffer[trl][starttransliterate][stoptransliterate]% }
and the buffer content isn't expanded before it is passed to lua; adding \@EA at various places has no effect either.
\def\starttransliterate {\bgroup \dosingleempty\dostarttransliterate} \long\def\dostarttransliterate[#1]#2\stoptransliterate {\iffirstargument \setuptransliterate[#1]% \fi \translate[\TRLhyphenate]% \ctxlua{translit.transliterate("\TRLmode","\luaescapestring{#2}")}% \egroup} Wolfgang
Am 07.03.10 17:59, schrieb Philipp Gesang:
* It is common in ConTeXt to also provide a \starttransliteration
... \stoptransliteration pair (but there is no requirement).
I noticed that and of course I would like to provide an environment but until now I had no luck implementing it; I'm stuck at the following:
\def\starttransliterate {% \bgroup\dostarttransliterate% }
\def\stoptransliterate {% \egroup \transliterate{% \getbuffer[trl]% }% }
\def\dostarttransliterate{% \dostartbuffer[trl][starttransliterate][stoptransliterate]% }
and the buffer content isn't expanded before it is passed to lua; adding \@EA at various places has no effect either.
\def\starttransliterate {\bgroup \dosingleempty\dostarttransliterate} \long\def\dostarttransliterate[#1]#2\stoptransliterate {\iffirstargument \setuptransliterate[#1]% \fi \translate[\TRLhyphenate]% \ctxlua{translit.transliterate("\TRLmode","\luaescapestring{#2}")}% \egroup} Wolfgang
On 2010-03-07 <18:12:56>, Wolfgang Schuster wrote:
Am 07.03.10 17:59, schrieb Philipp Gesang: \def\starttransliterate {\bgroup \dosingleempty\dostarttransliterate}
\long\def\dostarttransliterate[#1]#2\stoptransliterate {\iffirstargument \setuptransliterate[#1]% \fi \translate[\TRLhyphenate]% “\translate” -- is this a typo? I need to setup hyphenation.
\ctxlua{translit.transliterate("\TRLmode","\luaescapestring{#2}")}% \egroup}
Wolfgang
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
On 2010-03-07 <13:02:40>, Mojca Miklavec wrote:
On Sun, Mar 7, 2010 at 11:54, Philipp Gesang wrote:
Right now it contains modes for Cyrillic, Glagolitic and Greek scripts, older variants included, full ISO 9 support
Doesn't ISO 9 (ISO-8859-9) support already work? Normally I don't post wp links but as you'd have to pay for the standard, here's an exception: http://en.wikipedia.org/wiki/ISO_9 No encoding issue involved here ...
Philipp
Mojca ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Hi Philipp, a few comments to your code (TeX only): You don't need a MKIV-file and keep everything in the TeX-file, what you can also do is to move the Lua-code in a separate file. It's nice to see you provide a (XML) interface file. Wouldn’t \setuptransliterate a better name for the setup command! A better defintion for the setup command is: \def\setupTranslit{\dodoubleargument\getparameters[TRL]} You can also shorten your \dotransliterate macro a lot: \def\dotransliterate[#1]#2% {\bgroup \iffirstargument \getparameters[TRL][#1]% \fi \language[\TRLhyphenate]% \ctxlua{translit.transliterate("\TRLmode","\luaescapestring{#2}")}% \egroup} You have a wrong entry in the modules metadata: %D [ file=t-degrade, Wolfgang
On 2010-03-07 <14:40:37>, Wolfgang Schuster wrote:
Hi Philipp,
a few comments to your code (TeX only): That's really nice.
You don't need a MKIV-file and keep everything in the TeX-file, what you can also do is to move the Lua-code in a separate file. I already considered this but have to make my mind up about it first.
Wouldn’t \setuptransliterate a better name for the setup command! That's true; done!
A better defintion for the setup command is:
\def\setupTranslit{\dodoubleargument\getparameters[TRL]} Done!
You can also shorten your \dotransliterate macro a lot:
\def\dotransliterate[#1]#2% {\bgroup \iffirstargument \getparameters[TRL][#1]% \fi \language[\TRLhyphenate]% \ctxlua{translit.transliterate("\TRLmode","\luaescapestring{#2}")}% \egroup} Done. Seems I was a little too worried about not overwriting the globals; I promise to never ever mistrust TeX's grouping again.
You have a wrong entry in the modules metadata:
%D [ file=t-degrade, Done! As is plain to see, I just copy-n-pasted those parts initially. Is there a clean template somewhere around?
Thanks a lot for your improvements! Philipp
Wolfgang
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Am 07.03.10 17:43, schrieb Philipp Gesang:
Is there a clean template somewhere around?
No, I use also always one from my older files too. What you should set are 'title', 'subtitle' and 'author' because they are used when you generate a formated source. <template> %D \module %D [ file=<filename>, % e.g. t-xxx %D version=<module version>, % e.g. 20xx.xx.xx %D title=<module title>, % e.g. \CONTEXT\ User Module %D subtitle=<module subtitle>, %D author=<author>, %D date=<date>, % e.g. \currentdate %D copyright=<copyright>, % e.g. <author> %D license=<license>] % e.g. 'Public Domain' or 'GNU GPL 2.0' \unprotect % module code \protect \endinput </template> Wolfgang
Wikified! On 2010-03-07 <18:42:33>, Wolfgang Schuster wrote:
Am 07.03.10 17:43, schrieb Philipp Gesang:
Is there a clean template somewhere around? No, I use also always one from my older files too.
What you should set are 'title', 'subtitle' and 'author' because they are used when you generate a formated source.
<template> %D \module %D [ file=<filename>, % e.g. t-xxx %D version=<module version>, % e.g. 20xx.xx.xx %D title=<module title>, % e.g. \CONTEXT\ User Module %D subtitle=<module subtitle>, %D author=<author>, %D date=<date>, % e.g. \currentdate %D copyright=<copyright>, % e.g. <author> %D license=<license>] % e.g. 'Public Domain' or 'GNU GPL 2.0'
\unprotect
% module code
\protect \endinput </template>
Wolfgang
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
On Mar 7, 2010, at 11:54 AM, Philipp Gesang wrote:
Dear listmates,
a while ago I made some lua tables to aid myself in transliteration of various scripts. I thought this could be of general interest so I built a module from it and threw in some source documentation and a tiny manual, too.
Right now it contains modes for Cyrillic, Glagolitic and Greek scripts, older variants included, full ISO 9 support, and two transcription modes as well. (There's a showcase of examples in the manual.) It allows global setups and local adjustments.
It splendidly works for me, but maybe somebody wants to review it before I post it on the list, so whom should I send it to?
Just one thought on your transliterator: a couple of years ago, Hans set up something a bit similar for Greek. It is based on lpeg, though, not gsub and so should be somewhat faster. If you look at context/tex/texmf-context/scripts/context/lua/mtx-babel.lua you'll see what he did. In theory, this mechanism is general, and all sorts of transliteration schemes could be hooked into it. Might give you some ideas or not... Thomas
On 2010-03-07 <20:07:16>, Thomas A. Schmitz wrote:
On Mar 7, 2010, at 11:54 AM, Philipp Gesang wrote:
Just one thought on your transliterator: a couple of years ago, Hans set up something a bit similar for Greek. It is based on lpeg, though, not gsub and so should be somewhat faster. If you look at context/tex/texmf-context/scripts/context/lua/mtx-babel.lua you'll see what he did. In theory, this mechanism is general, and all sorts of transliteration schemes could be hooked into it. Might give you some ideas or not...
I'm afraid lpegs, elegant though they are, would complicate the matter a bit. Try this: \startluacode s1, s2, s3 = "abc", "äbz", "аbc" p1, p2, p3 = lpeg.P("a") , lpeg.P("ä") , lpeg.P("а") -- ^ == u1072 context(lpeg.match(p1, s1)) --> 2, correct context(lpeg.match(p2, s2)) --> 3, wrong context(lpeg.match(p3, s3)) --> 3, wrong \stopluacode You'll see that lpeg isn't unicode-aware. On the other hand Roberto has a snippet on his page[1] that gets the unicode number out of an utf-8 octet sequence (up to 4 bytes), though I don't hasten to go this way: it would mean converting all the tables into integers, converting the input into an array of ints, then do multi-char replacement (=integer substitution) on this array and finally converting it back into sequence of chars. Not sure if transliteration of some single words is worth it. Anyway, I'm glad you pointed me to the babel script as I hadn't noticed it before. Philipp [1] http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#ex
Thomas ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Early shots often go wrong; I take that back; capturing 1 multibyte character actually works if you know its utf length! Just have to write the parsers for the tables now. Philipp On 2010-03-08 <07:55:06>, Philipp Gesang wrote:
On 2010-03-07 <20:07:16>, Thomas A. Schmitz wrote:
On Mar 7, 2010, at 11:54 AM, Philipp Gesang wrote:
Just one thought on your transliterator: a couple of years ago, Hans set up something a bit similar for Greek. It is based on lpeg, though, not gsub and so should be somewhat faster. If you look at context/tex/texmf-context/scripts/context/lua/mtx-babel.lua you'll see what he did. In theory, this mechanism is general, and all sorts of transliteration schemes could be hooked into it. Might give you some ideas or not...
I'm afraid lpegs, elegant though they are, would complicate the matter a bit. Try this:
\startluacode s1, s2, s3 = "abc", "äbz", "аbc" p1, p2, p3 = lpeg.P("a") , lpeg.P("ä") , lpeg.P("а") -- ^ == u1072 context(lpeg.match(p1, s1)) --> 2, correct context(lpeg.match(p2, s2)) --> 3, wrong context(lpeg.match(p3, s3)) --> 3, wrong \stopluacode
You'll see that lpeg isn't unicode-aware. On the other hand Roberto has a snippet on his page[1] that gets the unicode number out of an utf-8 octet sequence (up to 4 bytes), though I don't hasten to go this way: it would mean converting all the tables into integers, converting the input into an array of ints, then do multi-char replacement (=integer substitution) on this array and finally converting it back into sequence of chars. Not sure if transliteration of some single words is worth it.
Anyway, I'm glad you pointed me to the babel script as I hadn't noticed it before.
Philipp
[1] http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#ex
Thomas ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
participants (5)
-
Mojca Miklavec
-
Philipp Gesang
-
Taco Hoekwater
-
Thomas A. Schmitz
-
Wolfgang Schuster