transliteration russian
Hi all, I am just about to typeset a book of a russian author written in english, but with a lot of russian literature listed in the bibliography: The titles of theses sources are russian but in latin transliteration, like this ... O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov Rossijskoj Federacii But even though I assigned "\language[ru]" the word "vneshnejekonomicheskih" eg. does not get hyphenated. And there are some dozen titles more that show the same problem ... Is this (to not hyphenate) because of the transliteration? Do I have to choose another \language key? Yours, Steffen
On Oct 29, 2010, at 1:18 PM, Steffen Wolfrum wrote:
Hi all,
I am just about to typeset a book of a russian author written in english, but with a lot of russian literature listed in the bibliography: The titles of theses sources are russian but in latin transliteration, like this ... O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov Rossijskoj Federacii
But even though I assigned "\language[ru]" the word "vneshnejekonomicheskih" eg. does not get hyphenated. And there are some dozen titles more that show the same problem ...
Is this (to not hyphenate) because of the transliteration? Do I have to choose another \language key?
Of course. To the luaTeX parser, the transliterated Russian is just gobbledygook, the hyphenation patterns expect proper unicode input. Thomas
On 10/29/2010 01:58 PM, Thomas A. Schmitz wrote:
On Oct 29, 2010, at 1:18 PM, Steffen Wolfrum wrote:
Hi all,
I am just about to typeset a book of a russian author written in english, but with a lot of russian literature listed in the bibliography: The titles of theses sources are russian but in latin transliteration, like this ... O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov Rossijskoj Federacii
But even though I assigned "\language[ru]" the word "vneshnejekonomicheskih" eg. does not get hyphenated. And there are some dozen titles more that show the same problem ...
Is this (to not hyphenate) because of the transliteration? Do I have to choose another \language key?
I would expect slavic languages (cz, pl) to give better results in hyphenation of this transliterated text, though they will not give perfect results and exceptions will be needed. I'm assuming the reader how expects Russian hyphenation rules in these cases. Jano
On Fri, Oct 29, 2010 at 13:18, Steffen Wolfrum wrote:
Hi all,
I am just about to typeset a book of a russian author written in english, but with a lot of russian literature listed in the bibliography: The titles of theses sources are russian but in latin transliteration, like this ... O koordinacii mezhdunarodnyh i vneshnejekonomicheskih svjazej subjektov Rossijskoj Federacii
But even though I assigned "\language[ru]" the word "vneshnejekonomicheskih" eg. does not get hyphenated. And there are some dozen titles more that show the same problem ...
Is this (to not hyphenate) because of the transliteration? Do I have to choose another \language key?
Dear Steffen, The Russian patterns only cover the Cyrillic part. Serbian patterns are the only ones that cover both scripts, but even then the patterns themselves are seen as two different languages by TeX. The best thing to do would be to transliterate Russian patterns into Latin script (under one condition: transliteration needs to be one-to-one; if one cyrillic glyph transliterates into two latin characters, that doesn't help you). If you use LuaTeX you may then load the patterns on the fly. Another "easy" option would be to load any other slavic patterns as Jano suggested and then add exceptions where needed. I'm not sure if transliterated patterns belong to hyph-utf8. (If nothing else, Russian is transliterated differently into Slovenian for example, so one would formally then need "transliteration from Russian to any other given language written in Cyrillic script"). [still under assumption that you use LuaTeX and that transliteration is one-to-one] By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed. Mojca
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed.
I was thinking, since using \lccode for hyphenation is really a wired choice (I'm sure don has a good reason back then, but such things are usually no longer relevant), and since it is used in a sort of controlled environment (playing with \lccode's for hyphenation is not ever one's toy), may be luatex can break the backward compatibility in the hyphenation area and have a dedicated new code, \hycode or something, only for hyphenation purposes (may be backward compatibility can be kept by using it in addition to \lccode, maybe). What do you think? Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
On 30-10-2010 12:05, Khaled Hosny wrote:
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed.
I was thinking, since using \lccode for hyphenation is really a wired choice (I'm sure don has a good reason back then, but such things are usually no longer relevant), and since it is used in a sort of controlled environment (playing with \lccode's for hyphenation is not ever one's toy), may be luatex can break the backward compatibility in the hyphenation area and have a dedicated new code, \hycode or something, only for hyphenation purposes (may be backward compatibility can be kept by using it in addition to \lccode, maybe).
What do you think?
just any letter (catcode letter) would do and the rest is to be controlled by the patterns Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 10/30/2010 10:17 AM, Hans Hagen wrote:
On 30-10-2010 12:05, Khaled Hosny wrote:
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned"
You could try to convince me, but that would take considerable effort because that is a form of cheating that I am not comfortable with. Besides, in the non-trivial cases, a single cyrillic letter maps to multiple latin ones, and setting that up as an internal remapping is not trivial. There is a simpler solution, I think: treat transliterations as a separate language on the macro side. Generating the patterns for that new language is simple if the transliteration rules are correct; just do the replacements like so: ‘я’->‘j8a’ Best wishes, Taco
On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:
On 30-10-2010 12:05, Khaled Hosny wrote:
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed.
I was thinking, since using \lccode for hyphenation is really a wired choice (I'm sure don has a good reason back then, but such things are usually no longer relevant), and since it is used in a sort of controlled environment (playing with \lccode's for hyphenation is not ever one's toy), may be luatex can break the backward compatibility in the hyphenation area and have a dedicated new code, \hycode or something, only for hyphenation purposes (may be backward compatibility can be kept by using it in addition to \lccode, maybe).
What do you think?
just any letter (catcode letter) would do and the rest is to be controlled by the patterns
The issue here is that we want to make some character equivalent to each other, e.g. ' and ’ which are needed for some languages, without the need to duplicate the patterns. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Hi! On 10/30/2010 11:34 AM, Khaled Hosny wrote:
On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:
On 30-10-2010 12:05, Khaled Hosny wrote:
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed.
I was thinking, since using \lccode for hyphenation is really a wired choice (I'm sure don has a good reason back then, but such things are usually no longer relevant), and since it is used in a sort of controlled environment (playing with \lccode's for hyphenation is not ever one's toy), may be luatex can break the backward compatibility in the hyphenation area and have a dedicated new code, \hycode or something, only for hyphenation purposes (may be backward compatibility can be kept by using it in addition to \lccode, maybe).
What do you think?
just any letter (catcode letter) would do and the rest is to be controlled by the patterns
The issue here is that we want to make some character equivalent to each other, e.g. ' and ’ which are needed for some languages, without the need to duplicate the patterns.
Before jumping too deep to the subject, consider if it really worth an effort. There is not much more then, titles written in the transliterated text. No continuous reading. My experience says, whatever language is the original title, reader usually expects hyphenation similar to the language of the main text. Whenever I've used English patterns in English titles (even citations), they where changed by the Czech proofreader -- though they were perfectly correct in English -- to resemble Czech patterns. I'm not saying it is the right approach, but from the readers' and proofreaders' point of view if he reads in Czech and doesn't now English patterns or even English, patterns different from Czech are disturbing. Jano
On Sun, Oct 31, 2010 at 07:12:20PM +0100, Jano Kula wrote:
Hi!
On 10/30/2010 11:34 AM, Khaled Hosny wrote:
On Sat, Oct 30, 2010 at 10:17:11AM +0200, Hans Hagen wrote:
On 30-10-2010 12:05, Khaled Hosny wrote:
On Fri, Oct 29, 2010 at 11:25:20PM +0200, Mojca Miklavec wrote:
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have). Actually, you can already do that by redefining \lccode of latin a to point to cyrillic a (and do that for the whole alphabet), but then you need to make sure that you don't use any commands for lowercasing/uppercasing words. If you need details, I can help you out, but first exact transliteration rules are needed.
I was thinking, since using \lccode for hyphenation is really a wired choice (I'm sure don has a good reason back then, but such things are usually no longer relevant), and since it is used in a sort of controlled environment (playing with \lccode's for hyphenation is not ever one's toy), may be luatex can break the backward compatibility in the hyphenation area and have a dedicated new code, \hycode or something, only for hyphenation purposes (may be backward compatibility can be kept by using it in addition to \lccode, maybe).
What do you think?
just any letter (catcode letter) would do and the rest is to be controlled by the patterns
The issue here is that we want to make some character equivalent to each other, e.g. ' and ’ which are needed for some languages, without the need to duplicate the patterns.
Before jumping too deep to the subject, consider if it really worth an effort. There is not much more then, titles written in the transliterated text. No continuous reading.
It not about the problem in this thread specifically, but rather another issue that were brought recently in xetex mailing list; basically if one is using the curly apostrophe (’) all hyphenation patterns depends on the ASCII one (') will not be taken into account. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
On 10/29/2010 11:25 PM, Mojca Miklavec wrote:
The best thing to do would be to transliterate Russian patterns into Latin script (under one condition: transliteration needs to be one-to-one; if one cyrillic glyph transliterates into two latin characters, that doesn't help you). If you use LuaTeX you may then load the patterns on the fly.
Warning: the transliteration used in Steffen's document is (or at least the example is) lossy and as such will likely produce wrong hyphenation output no matter the applied method of making TeX hyphenate it. The transliteration (in the example) is also inconsistent - if you tried to reverse transliterate it to Cyrillic, you would not only miss some characters, but you would also get some other characters wrong. Examples: - 'subjektov' is 'субъектов', - 'vneshnejekonomicheskih' is 'внешнеэкономических', thus 'je' stands for both 'ъе' and for 'э'. This however could be just the authors typo. In such case 'subjektov' should be corrected to 'sub"ektov'. The way to achieve a univocal (one-to-one) transliteration would be first to reverse transliterate it to Cyrillic, and then transliterate back to Latin using ISO 9 transliteration standard: http://en.wikipedia.org/wiki/ISO_9 The example 'О координации международных и внешнеэкономических связей субъектов Российской Федерации' would then output 'O koordinacii meždunarodnyh i vnešneèkonomičeskih svâzej sub"ektov Rossijskoj Federacii'. This however I wouldn't consider a very human-readable output. A very handy tool for experiments can be found here: http://translit.cc/ On the margin: Wouldn't it be much better to use just Cyrillic for that? -- Andrzej Orłowski-Skoczyk
2010/10/30 Andrzej Orłowski-Skoczyk wrote:
On 10/29/2010 11:25 PM, Mojca Miklavec wrote:
The best thing to do would be to transliterate Russian patterns into Latin script (under one condition: transliteration needs to be one-to-one; if one cyrillic glyph transliterates into two latin characters, that doesn't help you). If you use LuaTeX you may then load the patterns on the fly.
Warning: the transliteration used in Steffen's document is (or at least the example is) lossy and as such will likely produce wrong hyphenation output no matter the applied method of making TeX hyphenate it.
I didn't inspect the transliteration, but now that you point it out - true, to achieve perfect results, one would need to completely redesign the patterns. ... or simply use a random slavic language and fix the wrong hyphenations one-by-one (in particular, words with sh/ch could easily break even though they represent a single letter).
The example 'О координации международных и внешнеэкономических связей субъектов Российской Федерации' would then output 'O koordinacii meždunarodnyh i vnešneèkonomičeskih svâzej sub"ektov Rossijskoj Federacii'. This however I wouldn't consider a very human-readable output.
... it depends on who the human is. Slavic-speaking countries have no problem pronouncing čšž ... :) :) :) Quotation marks are a bit weird though ... Maybe the most sensible solution (assuming LuaTeX) that would work perfectly but would not be easy to write could be to input the title in Cyrillic script, let TeX hyphenate it, and finally output automatically transliterated string. Mojca
Am 30.10.2010 um 00:15 schrieb Andrzej Orłowski-Skoczyk:
Warning: the transliteration used in Steffen's document is (or at least the example is) lossy and as such will likely produce wrong hyphenation output no matter the applied method of making TeX hyphenate it.
The transliteration (in the example) is also inconsistent - if you tried to reverse transliterate it to Cyrillic, you would not only miss some characters, but you would also get some other characters wrong.
Andrzej, thanks for your statement! Thus I will leave it to the author to draw in the appropriate break points when reading the first proof. After all it's her text. I think the results were not in proportion to the effort, when we were trying to work on a general solution on the context/luatex side. At least not for this specific project. My question starting this thread was made under the assumption of a good transliteration ... Thank you all for your very interesting hints and notes! Steffen
On 2010-10-29 <23:25:20>, Mojca Miklavec wrote:
The best thing to do would be to transliterate Russian patterns into Latin script (under one condition: transliteration needs to be one-to-one; if one cyrillic glyph transliterates into two latin
The one in question is rather a transcription (‘romanization’) than a transliteration, thus unfortunately there is no bijective mapping (e.g. ‘я’->‘ja’, ‘ш’->‘sh’ etc.). It seems to be a hybrid between the standard Library of Congress-style transcription and an older ISO or ΓΟСТ transliteration. Also, ‘j’ occurs in very odd positions. Whatever it is, we would need the complete transcription mapping. As others already pointed out, with a small number of strings Steffen might get acceptable results by using the patterns of a similar language. Although real transliterations work best with Czech or Slovak, this peculiar transcription might be better off with Polish or even (judging by the use of ‘sh’) standard English. @Steffen, if you could convince the author to supply the original Russian text and if he would agree to use a more common style, you could let the transliteration module do the job instead (http://bitbucket.org/phg/transliterator).
By far the easiest and most portable solution would be if you could convince Taco to implement something like "latin a is equivalent to cyrillic a as far as hyphenation is concerned" (which could also solve many other problems that we have).
+1. This would be a great feature. Good night all, Philipp
Mojca ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
On 10/30/2010 12:47 AM, Philipp Gesang wrote:
As others already pointed out, with a small number of strings Steffen might get acceptable results by using the patterns of a similar language. Although real transliterations work best with Czech or Slovak, this peculiar transcription might be better off with Polish or even (judging by the use of ‘sh’) standard English.
I'm afraid Polish will not do (Polish always hyphenates sz-cz, though in Russian shch is one character; and such). I'm afraid none Slavic language will do unless there is one that uses Latin script _and_ soft/hard sign (yer) - these are tricky, not similar to anything you meet in Polish/Czech and so on. -- Andrzej Orłowski-Skoczyk
On 2010-10-30 <01:06:33>, Andrzej Orłowski-Skoczyk wrote:
On 10/30/2010 12:47 AM, Philipp Gesang wrote:
As others already pointed out, with a small number of strings Steffen might get acceptable results by using the patterns of a similar language. Although real transliterations work best with Czech or Slovak, this peculiar transcription might be better off with Polish or even (judging by the use of ‘sh’) standard English.
I'm afraid Polish will not do (Polish always hyphenates sz-cz, though in Russian shch is one character; and such).
Of course, your point is clear. Still I think Polish would be of more use than Czech in this case because it shares more similarities withe the transcribed Russian. E.g. Russian and Polish have ‘ks’ where Czech has ‘x’; both Ru&Pl allow ‘ki’ and ‘gi’ which is illegal in Cz; and Czech lacks a native ‘g’, while others have kept it. Thus you can hope for more valid hyphenation points if you use the Polish patterns, don’t you?
I'm afraid none Slavic language will do unless there is one that uses Latin script _and_ soft/hard sign (yer) - these are tricky, not similar to anything you meet in Polish/Czech and so on.
None of them are perfect, but most cases don’t require perfection. Trans[cription|literation] rarely occurs in masses, so often I just insert the break points by hand and forget about it. Regards, Philipp
-- Andrzej Orłowski-Skoczyk ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
participants (9)
-
Andrzej Orłowski-Skoczyk
-
Hans Hagen
-
Jano Kula
-
Khaled Hosny
-
Mojca Miklavec
-
Philipp Gesang
-
Steffen Wolfrum
-
Taco Hoekwater
-
Thomas A. Schmitz