Christopher Creutzig wrote:
We already have Iconv in ruby and can, if we know that ISO-8859-2 is a single byte coding system, simply say
conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] }
to get the whole list, assuming we've filled the lookup hash first.
Great! Sorry for all my philosophising! I don't know ruby (yet) and I didn't even think about this possibility. My last idea was to parse and combine the data on http://www.unicode.org/Public/MAPPINGS/VENDORS/, http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt, but your idea is hundred times faster and better! Thanks a lot!
As you've said, I'd combine steps A2 and A3, to make ConTeXt run faster.
That's OK for me. If there's a simple internal ruby tool (called every time when unicode->tex mapping changes or some more encoding support is added) instead of one-time-script, there should be no problem to do that directly.
If you want, for whatever reason, to use \textellipsis for an ellipsis (it just looks horribly wrong to me) instead of \dots, you'd need to invoke the ruby script which generates the regi-* files.
I just wanted to give an example that changes are sometimes needed and that it is difficult to trace all the places where they should have been made. Sorry, this example wasn't very ilustrative, I don't even know what \textellipses stands for, I just saw some comments about changes made in regi-* files or some discrepancies.
The whole thing should not require any change at all to ConTeXt itself, since the regi-* files could look exactly as they do now, just being generated automatically. (For the multibyte encodings, the whole thing gets much more tricky.)
I noticed (perhaps I'm wrong) that TeX community support for cyrillic may be better than that in unicode and in the available old 8bit encodings. ConTeXt is also already supporting those strange regimes (ctt, dbk, mls, mnk, mos, ncc, ...) that I was unable to find anywhere else. In this case one should also be careful in order not to spoil this already available feature. I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once? Mojca