towards some more consistency in regimes & unicode support
Hello, Sorry for a slightly longer mail. I wanted to send it to context-dev, but probably there's someone else besides Adam out there who could contribute (for example to re-chech Greek or Cyrillic section of Unicode or even add some missing Hebrew definitions for example). If someone thinks that it's more appropriate, please feel free to continue the discussion on context-dev. I. in regi-utf it would be fine to add: \defineregimesynonym[utf-8][utf] \defineregimesynonym[utf8][utf] II. After a long time I finally decided to write my first ruby script. I took UnicodeData.txt, adobe glyph list, enco-uc.tex, collected averything together, removed characters >FFFF (in case someone needs them they can trivially be added again, but I don't think that anyone is planning to name them shortly), did some manual corrections ... and here are the results: http://pub.mojca.org/tex/enco/contextlist/ http://pub.mojca.org/tex/enco/contextbase/regi-temp.tex The idea behind is that there is no "definite refence" to the ConTeXt glyph names, which means that every new regime that should be supported needs a lot of manual work and leads to many inconsistencies. The file contextnames.txt contains the Unicode hexadecimal number, pdf name (from Adobe Glyph List), ConTeXt name and the Unicode name. This could then be a source of information when adding new regimes, writing unicode vectors (unic-*), mapping to font encodings, uppercasing/lowercasing information for font encoding and other files can now be derived directly from unicode and this list (unicode already contains information about upper/lowercase variants of the letters) ... There is some more info missing, which should be either packed within the same file or in separate files: - ConTeXt synonyms (like \Dcroat -> \Dstroke, ...) - pdf synonyms (dbar -> dcroat), to help recognize the glyphs in .enc or .afm and automate support for it - faking the characters (\ccaron -> \buildtextaccent\textcaron{C}) - unaccented version of the characters (\Aacute -> A, ...) - other characters not present in unicode (Caron, Acute - these are accents for uppercase letters, ...) - (I'm sure that I wanted to add some more points, but I don't remember any other right now) When I wanted to add the names from unic-34.tex, I realized that we don't really need to have a command for "every single unicode character" (we certainly don't need to map math characters into that region), but if someone already has a file with unicode integrals, it costs nothing to give him those characters in output. (Shortly: 0x2211, "N-ARY SUMMATION" should expand into $\sum$, but not the other way round) I have to slightly change the syntax in the context glyph names file to note this difference and to be able to define math (and other) signs properly. ------------------------------------------------------------------------ III. Now I need some help - someone should help me revise the file contextname.txt (I prepared a HTML version of it): correct mistakes (if any are spotted), add new definitions, help to prepare a list of synonyms, a list of expansions (\buildtextaccent), ... ------------------------------------------------------------------------ Here are some points which I spotted, but can't fix them alone 1. Characters missing (needed by some regimes): 0020-007F section 037A GREEK YPOGEGRAMMENI 0384 GREEK TONOS 0385 GREEK DIALYTIKA TONOS 2015 HORIZONTAL BAR 2017 DOUBLE LOW LINE 20AA NEW SHEQEL SIGN 20AB DONG SIGN 20AF DRACHMA SIGN 2116 NUMERO SIGN 200E LEFT-TO-RIGHT MARK 200F RIGHT-TO-LEFT MARK 1Exx section 2. Greek - there are some name inconsistencies when compared to the unic-031 vector, but I don't know anything about old greek. I didn't check Cyrillic at all. 3. Punctuation and accents - mostly names for quotes and language dependency (lowerleftuppersixquote in comparison to lftdblquote ... or whatever they are called) (+ tricks, I already asked about quotes & hyphenation approximately a week ago). I have problems understanding the difference between letter modifiers (U+02Cx) and usual accents (U+00Bx), "Combining Diacritical Marks" (U+03xx) should be supported somehow as well. I have no idea how to make U+0065 U+0301 (e + combining acute accent) into eacute. 4. should hungarumlaut be doubleacute and hungarumlaut only its synonym or the other way round? 5. tbar vs. tstroke: compare 0166 and 023E 6. cedilla/commaaccent dilema: there's a huge problem with "t with cedilla" (0162): "t with comma below" (021A) sould be used instead (at least this is stated in Unicode reference), but most regimes map a character to "t with cedilla" (0162), which seems stupid to me. Adobe glyph list therefore uses tcommaaccent for "t with cedilla", which looks like "t with comma accent", but is on the wrong place. lmr have both tcommaaccent and tcedilla. \tcedilla should be "t with cedilla" in my opinion and \tcommaaccent "t with comma accent". That currently isn't the case in ConTeXt unless something has changed recently. There are many other letter wrongly named in Unicode ("with cedilla"), although they have a comma. I would suggest to name them \[gklnr]commaaccent and use \[gklnr]cedilla as a synonym (if needed at all for backward compatibility, otherwise it would be better to leave them out; there is no such letter with cedilla in unicode, if someone needs one, he can construct one trivially with \buildtextaccent) 7. there's "a-kind-of-bug-but-not-really-one" in enco-ans.tex. textcedilla maps to 184, which isn't defined in Antykwa for example (it's on place 24). It's more a "bug" in texnansi encoding, which has cedilla on two places, which is pretty stupid. But anyway: \definecharacter textcedilla 24 would solve some problems (and hopefully not introduce new ones). 8. most letters are named "c with cedilla" -> ccedilla what about the names for "open o", "turned e", "long s", "turned r with hook"? \openo or \oopen? \rturnedhook or \turnedrhook? 9. can latin letters and numbers be accessed somehow by name? 10. Adam prepared some dingbats support I think, this could be added here. 11. There's a showunicode pdf document on pragma-ade.com (at least I saw it once), but it's not listed on the overview.htm. 12. I don't know if anyone would ever need to switch from viscii regime to some other, but what would happen to the characters under 128 (some of them are redefined in viscii)? I'm affraid that there would remain Vietnamese leftovers in the lower part of the table. 13. If there are any other comments on the table and/or the script(s), please let me know. IV. With the help of the prepared names list I processed definitions for regimes (taken from Unicode webpage) for ISO-8859-* and cp125* (others should be trivial). They are only preliminary, some (Hebrew, Thai, Arabic) probably don't make any sense yet, but could the rest be added to ConTeXt after someone checks if everything is OK? (iso88595, cp1251, il1, il2, il9, windows and viscii regimes already exist and should be compared for differences) If possible in such a way that it wouldn't be necessary to include the regime definition file manually, but similarly as \usemodule[pre-polish] finds and processes the proper file, the \enableregime[xxx] should find the proper file and load it. (And for those who made it till here - sorry again for that gigantic mail.) Mojca
Mojca, I'm not sure I've understood all you're trying to do, but I feel kind of responsible for the Greek. I took the polutonic/ancient Greek basically from the Unicode names, but I left modern/monotonic Greek alone because the support was already there and I didn't want to mess up somebody else's work. As for the three slots you mention: 037A GREEK YPOGEGRAMMENI 0384 GREEK TONOS 0385 GREEK DIALYTIKA TONOS These are characters that are never (?) used on their own, only to combine with vowels. But let me know if there are more inconcsitencies, and I'll try and fix them for the 31-vector. Best Thomas On Sep 13, 2005, at 5:12 PM, Mojca Miklavec wrote:
2. Greek - there are some name inconsistencies when compared to the unic-031 vector, but I don't know anything about old greek. I didn't check Cyrillic at all.
Thomas A. Schmitz wrote:
Mojca,
I'm not sure I've understood all you're trying to do, but I feel kind of responsible for the Greek.
Thank you very much, Thomas!
I took the polutonic/ancient Greek basically from the Unicode names, but I left modern/monotonic Greek alone because the support was already there and I didn't want to mess up somebody else's work. As for the three slots you mention:
037A GREEK YPOGEGRAMMENI 0384 GREEK TONOS 0385 GREEK DIALYTIKA TONOS
These are characters that are never (?) used on their own, only to combine with vowels. But let me know if there are more inconcsitencies, and I'll try and fix them for the 31-vector.
I would say that the same is true for acute/grave/circumflex accent in latin, but they're there and we need a name for them in order to be able to compose (fake) characters out of it (\buldtextaccent\textgrave{a} to get agrave). What do you do with those characters in cp1253 encoding http://www.microsoft.com/typography/unicode/1253.htm? Without those definitions the cp1253 input encoding cannot be fully supported, but is anyone using that regime at all? cp1250 (central european) is still widely used for example. For combining there are some others (unnamed): 0342 COMBINING GREEK PERISPOMENI 0343 COMBINING GREEK KORONIS 0344 COMBINING GREEK DIALYTIKA TONOS 0345 COMBINING GREEK YPOGEGRAMMENI but they need special treetment (not supported in ConTeXt yet) anyway. I know just about nothing about Greek fonts and their quality (coverage of Greek glyphs), but even with a pretty incomplete font you can then say something like: \definecharacter greekomegatonos \buildtextaccent\greektonos\greekomega and perhaps even \definecharacter greektonos \textacute where there is no special glyph for tonos present I guess that \greekypogegrammeni, \greektonos and \greekdialytikatonos would be just fine, I just asked because there may be some cases (like with many latin "cedilla" or "stroke" letters or "hacek" that was later renamed into "caron"), where Unicode is not as accurate as one would want it to be. An example of inconsistency of names: 1F0C \greekAlphapsilitonos GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA 1F0D \greekAlphadasiatonos GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA But I don't know anything about Greek, so I cannot judge which of the names is more accurate. Thanks again for help, Mojca
participants (2)
-
Mojca Miklavec
-
Thomas A. Schmitz