A few weeks ago, I looked at Context, because I wanted utf-8 hyphenation patterns for ancient Greek, but then I saw that the patterns shipped with Context have serious bugs. I had hoped to patch ctxtools, but the required changes went beyond my knowledge of Ruby. I recently posted a Perl script to the xetex mailing list that should perform the conversion to utf-8 correctly. I would be happy to modify the script to make the output more useful to Context users, but I don't use Context myself. Feedback is welcome. The essential problem with the patterns shipped with Context is that it is the result of a simple conversion, but the hyphenation rules in Greek are based on the definition of vowels and consonants, which changes in utf-8. The original 8-bit patterns of Dimitrios Filippou depend on the fact that in the Babel encoding accents come before the vowel (except for iota subscript), whereas in Unicode they are either combined with the vowel or come after it, depending on whether you use precomposed characters or not. -- Peter Heslin (http://www.dur.ac.uk/p.j.heslin)
Hi Peter
I recently posted a Perl script to the xetex mailing list that should perform the conversion to utf-8 correctly. I would be happy to modify the script to make the output more useful to Context users, but I don't use Context myself. Feedback is welcome.
i leave that to the ones using greek ... we only need the conversion rules; adding them to the relevant section of ctxtools is then no bug deal
The essential problem with the patterns shipped with Context is that it is the result of a simple conversion, but the hyphenation rules in Greek are based on the definition of vowels and consonants, which changes in utf-8. The original 8-bit patterns of Dimitrios Filippou depend on the fact that in the Babel encoding accents come before the vowel (except for iota subscript), whereas in Unicode they are either combined with the vowel or come after it, depending on whether you use precomposed characters or not.
hm, so those original patterns were latex dependent ... even more reason to ship patterns with context; of course bugs need to be fixed, (or if i uderstand, extended with the additional combinations) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Jun 30, 2006, at 12:06 AM, Hans Hagen wrote:
Hi Peter
I recently posted a Perl script to the xetex mailing list that should perform the conversion to utf-8 correctly. I would be happy to modify the script to make the output more useful to Context users, but I don't use Context myself. Feedback is welcome.
i leave that to the ones using greek ... we only need the conversion rules; adding them to the relevant section of ctxtools is then no bug deal
The essential problem with the patterns shipped with Context is that it is the result of a simple conversion, but the hyphenation rules in Greek are based on the definition of vowels and consonants, which changes in utf-8. The original 8-bit patterns of Dimitrios Filippou depend on the fact that in the Babel encoding accents come before the vowel (except for iota subscript), whereas in Unicode they are either combined with the vowel or come after it, depending on whether you use precomposed characters or not.
hm, so those original patterns were latex dependent ... even more reason to ship patterns with context; of course bugs need to be fixed, (or if i uderstand, extended with the additional combinations)
Hans
Peter, Hans, thanks for looking into this. I had realized something was fishy with the ConTeXt converted patterns, so I'd be extremely grateful if we could have a corrected version. Hans: do we need the actual conversion rules, or would it be enough if Peter or I included the actual patterns into lang-agr.hyp? That may be faster and less work for you. Best Thomas
Thomas A. Schmitz wrote:
thanks for looking into this. I had realized something was fishy with the ConTeXt converted patterns, so I'd be extremely grateful if we could have a corrected version. Hans: do we need the actual conversion rules, or would it be enough if Peter or I included the
i prefer the rules, so if you can sort that out with peter
actual patterns into lang-agr.hyp? That may be faster and less work for you.
since there is no infrastructure for patterns, and since i want to independent of anything happening in that area (keep in mind that we've been bitten by that too often: renaming, disappearing, funny internals, latex specific, limited encodings, etc) it's easier for me to occasionally run ctxtools on the originals and maintain that than to keep track of files Hans
Hans Hagen
i prefer the rules, so if you can sort that out with peter
In that case, you can examine the internals of my Perl script elhyph-utf8 and translate its logic to Ruby in ctxtools. But that is a non-trivial effort, and I cannot do it. A better alternative may be to have ctxtools simply call elhyph-utf8 as an external script. Does Context still have a dependency on Perl? If so, it would be much easier just to call the Perl script. I would be happy to ensure that elhyph-utf8 remains format-neutral. [A footnote: the original patterns are not Latex-specific, as you said, but are specific to the LGR encoding, which Latex Babel happens to use; but that Greek encoding is older than Babel, I think, and is also used elsewhere in the TeX world.]
since there is no infrastructure for patterns, and since i want to independent of anything happening in that area (keep in mind that we've been bitten by that too often: renaming, disappearing, funny internals, latex specific, limited encodings, etc)
I can appreciate your pain, but I'm sure that you are aware that there is also a danger in having Context fork its own patterns: that you may introduce bugs (as happened in this case), or that you may not pick up on upstream bug-fixes. Jonathan Kew has suggested that it might be desirable to have a set of general-purpose utf-8 hyphenation patterns in the texmf tree, which could be used by various TeX applications. From your comments it is clear that, in order for the Context community to buy into such a scheme, it would be necessary for this collection of patterns to be managed carefully, by consensus, and in a format-neutral manner, with good advance communication of any changes. If this were to happen, the advantage for Context is that the dangers I mentioned above could be minimized. But it is up to you to balance the potential risks and benefits for Context. -- Peter Heslin (http://www.dur.ac.uk/p.j.heslin)
Peter Heslin wrote:
Hans Hagen
writes: i prefer the rules, so if you can sort that out with peter
In that case, you can examine the internals of my Perl script elhyph-utf8 and translate its logic to Ruby in ctxtools. But that is a non-trivial effort, and I cannot do it. A better alternative may be to have ctxtools simply call elhyph-utf8 as an external script. Does Context still have a dependency on Perl? If so, it would be much easier just to call the Perl script. I would be happy to ensure that elhyph-utf8 remains format-neutral.
let's first look at the logic, i'm sure that Thomas can extend the conversion then (after all, it's logic -) the dependency on perl is mostly gone and will be completely gone in the future
I can appreciate your pain, but I'm sure that you are aware that there is also a danger in having Context fork its own patterns: that you may introduce bugs (as happened in this case), or that you may not pick up on upstream bug-fixes. Jonathan Kew has suggested that it might be
sure, but i've been bitten too often; context nowadays comes with a truckload of tools and methods, and if we had to adapt to something else everytime that latex is ready for it we quickly become improductive; keep in mind that in that case we not only had to eal with you, but also with another 20 pattern people; now we can just pick up and rearrange the bits and pieces; (a similar things happens with fonts, context had built in map file support before things like updmap (useless for context anyway) came around, so adapting to yet another method was counterproductive; so, context has its own encoding naming scheme -if only because the number of metrics that really ship is not that large-) [another nice example: context supported lm fonts right from the start, and in the end changes in names of map files took place because of other packages needs; so, again we are forces to ship our own stuff]
desirable to have a set of general-purpose utf-8 hyphenation patterns in the texmf tree, which could be used by various TeX applications. From
take alone the names ... every package has different preferences, for years i *did* use the (hardly) generic patterns that and each year something else was broken; context is used in production environments and we need stability in those areas
your comments it is clear that, in order for the Context community to buy into such a scheme, it would be necessary for this collection of patterns to be managed carefully, by consensus, and in a format-neutral
sure, that's the ideal world, but 25 years have learned that this is near to impossible; actually i tried to start such an effort, starting with the names, but i gave up on it simply because i foresaw waste of time btw, already quite some years ago i published a method for encoding neutral patterns, but i never got any response on that, http://www.pragma-ade.com/general/manuals/mpattern.pdf, also published in tugboat (and i did some presentations about it)
manner, with good advance communication of any changes. If this were to happen, the advantage for Context is that the dangers I mentioned above could be minimized. But it is up to you to balance the potential risks and benefits for Context.
we will gladly use your stuff but quite probably package in the context way (maybe ctxtools will simply copy the existing utf ones, repackaged in a context way); btw, context uses utf patterns also in non utf mode, i.e. in pdftex etc Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (4)
-
Hans Hagen
-
Peter Heslin
-
Peter Heslin
-
Thomas A. Schmitz