Hi, I was thinking about some utf-8 issues and where utf-8 might crop up. Please note that these are just musings at the moment and not a proposal. I am trying to get a hang of the situation and what kind of code would lead to what kind of complexity. Also, some ideas in here depend on a separate patch I am working on for PDFTeX (not yet completed). So this is _not_ intended as a work proposal or anything like that. It is just asking whether I have forgotten some aspects. It might lead to some proposals at a later point of time, but currently I am just trying to wrap my own head about some things. One thing I was thinking about was whether it might be a useful idea to make "char" an utf-8 entity (consisting of 4 bytes, either padded with 0 or with ff), so that ord and chr are not trivial operations. Also, a "packed array of char" would then want specific accessor functions. It would also be possible to represent unpacked arrays of char (and single characters) as UCS-32, but have packed arrays be coded in utf-8. Basically, the idea would be to move most of the utf-8/unicode issues to web2c, and then use something like iterators (defined in web2c with a suitable syntax) for accessing the input buffer. However, every place that currently has to explicitly deal with utf-8 (as compared to 8-bit TeX) would _still_ need to explicitly deal with utf-8 and would merely get different (and possibly more compact/natural) ways of expressing these kinds of conversions. So the benefits in "code pieces to touch" would not necessarily be impressive. So to get a better feeling about whether messing with web2c might be worth the trouble, I tried coming up with an estimate where utf-8 is actually used and/or useful, and in what ways. One place obviously is the input buffer. I am trying to find the time to work on the "partial line buffer" patch I talked about, where the necessity to read a whole line is no longer present. _If_ such a patch finds its way into PDFTeX, the space savings of utf-8 over UCS-32 will be irrelevant in the input buffer. UCS-32 would have the advantage of being able to more or less keep all the code like it is in PDFTeX. Without such a patch, however, the most compact input line buffer representation will appear desirable (as recent discussions about further extending the size clearly indicate). So what about the further workflow? From the input buffer, we have basically two uses: getting single characters (like in character nodes and similar): this would seem to imply UCS-32 as most efficient, and it is probably the most frequent operation all in all. Another use is looking up control sequence names. Those are stored in the string pool, and I think that utf-8 makes perfect sense there (mostly ASCII, possibly large amounts of data). Currently TeX uses the input buffer for interpreting control sequence names either directly from \... or from \csname ...\endcsname, so it appears necessary to put the input buffer into utf-8 if that is the coding in the string pool we want to compare with. However, one has to be aware that \csname ...\endcsname works from tokens, so its content is basically UCS-32 at the start. Now \... is certainly quite more common and important. I might, however, decide for the "partial line buffer" patch to use a _separate_ buffer for assembling control sequence names. Part of the reason is that TeX rearranges its input line when processing things like \r^^65lax with the result that error messages output lines that never occured in the source code. Treating the input lines in the buffer not as read-only data and already occasionally doing what amounts to "recoding" is an "optimization" that is really messy and probably not really worth the savings: control sequences, while occuring quite frequently in the input, will most likely be complex enough that the time needed for executing the control sequence is quite larger than the time needed for scanning it, and copying them into a separate place before looking them up is not likely to incur much overhead. So in the course of the partial line buffer patch to PDFTeX, I might introduce a separate place for assembly of control sequence names (which would in LuaTeX then be the suitable point for reconversion to utf-8), and the necessity for having control sequences and the input buffer coded in the same manner might then mostly disappear. Whether one makes use of it depends on where one would prefer to have the complexity: in the buffer handling code (which is distributed around TeX) or elsewhere. Not having to reconvert material from \csname...\endcsname and \scantokens and similar to utf-8 might also help to keep the utf-8 code complexity confined to few places. Where else is utf-8 interesting? One _very_ interesting place for it are hyphenation tables. It would seem to make a lot of sense to switch this sort of data structure into something utf-8 based as long as one makes sure that frequent prefix bytes (like C2) will not lead to inefficiency due to collisions. Since one can't specify breakpoints in _legal_ utf sequences _anywhere_ inside of a character's byte sequence, as long as the input is valid utf8, it seems like a perfectly usable idea to go via UTF-8. Disadvantage: since hyphenation is working from char nodes, this again implies having to convert UCS-32 to utf-8 before doing the hyphenation lookup. The current Aleph-inherited hyphenation works only in the Unicode baseplane and is 16-bit based and somewhat unstable (iniTeX will dump core when trying to hyphenate). Extending this to 21bit would either necessitate making use of surrogate pairs in UTF-16, or going to UTF-8. Since the complexities in either case are pretty much the same, UTF-8 would appear to be the more compact solution. One further consideration: \lefthyphenmin and \righthyphenmin will not be able to rely on the size of _byte_ strings. However, since they actually should rather count off grapheme clusters rather than single character codes (combining accents should not count as characters for \lefthyphenmin and \righthyphenmin, I think), one has to think about those separately, anyway. The hyphenation tables are special in that we would actually need to have utf-8 exposed as a _byte_ sequence with values from 0..255. Most other uses could be implemented with some packed array semantics inside of web2c, and using chr/ord. Did I forget places where utf-8 might or might not crop up? -- David Kastrup
participants (1)
-
David Kastrup