Some further utf-8 musings

7 May 2007

      Hi,

I was thinking about some utf-8 issues and where utf-8 might crop up.

Please note that these are just musings at the moment and not a
proposal.  I am trying to get a hang of the situation and what kind of
code would lead to what kind of complexity.  Also, some ideas in here
depend on a separate patch I am working on for PDFTeX (not yet
completed).  So this is _not_ intended as a work proposal or anything
like that.  It is just asking whether I have forgotten some aspects.
It might lead to some proposals at a later point of time, but
currently I am just trying to wrap my own head about some things.

One thing I was thinking about was whether it might be a useful idea
to make "char" an utf-8 entity (consisting of 4 bytes, either padded
with 0 or with ff), so that ord and chr are not trivial operations.
Also, a "packed array of char" would then want specific accessor
functions.  It would also be possible to represent unpacked arrays of
char (and single characters) as UCS-32, but have packed arrays be
coded in utf-8.

Basically, the idea would be to move most of the utf-8/unicode issues
to web2c, and then use something like iterators (defined in web2c with
a suitable syntax) for accessing the input buffer.  However, every
place that currently has to explicitly deal with utf-8 (as compared to
8-bit TeX) would _still_ need to explicitly deal with utf-8 and would
merely get different (and possibly more compact/natural) ways of
expressing these kinds of conversions.  So the benefits in "code
pieces to touch" would not necessarily be impressive.

So to get a better feeling about whether messing with web2c might be
worth the trouble, I tried coming up with an estimate where utf-8 is
actually used and/or useful, and in what ways.

One place obviously is the input buffer.  I am trying to find the time
to work on the "partial line buffer" patch I talked about, where the
necessity to read a whole line is no longer present.  _If_ such a
patch finds its way into PDFTeX, the space savings of utf-8 over
UCS-32 will be irrelevant in the input buffer.  UCS-32 would have the
advantage of being able to more or less keep all the code like it is
in PDFTeX.  Without such a patch, however, the most compact input line
buffer representation will appear desirable (as recent discussions
about further extending the size clearly indicate).

So what about the further workflow?  From the input buffer, we have
basically two uses: getting single characters (like in character nodes
and similar): this would seem to imply UCS-32 as most efficient, and
it is probably the most frequent operation all in all.  Another use is
looking up control sequence names.  Those are stored in the string
pool, and I think that utf-8 makes perfect sense there (mostly ASCII,
possibly large amounts of data).

Currently TeX uses the input buffer for interpreting control sequence
names either directly from \... or from \csname ...\endcsname, so it
appears necessary to put the input buffer into utf-8 if that is the
coding in the string pool we want to compare with.  However, one has
to be aware that \csname ...\endcsname works from tokens, so its
content is basically UCS-32 at the start.  Now \... is certainly quite
more common and important.  I might, however, decide for the "partial
line buffer" patch to use a _separate_ buffer for assembling control
sequence names.  Part of the reason is that TeX rearranges its input
line when processing things like

\r^^65lax

with the result that error messages output lines that never occured in
the source code.  Treating the input lines in the buffer not as
read-only data and already occasionally doing what amounts to
"recoding" is an "optimization" that is really messy and probably not
really worth the savings: control sequences, while occuring quite
frequently in the input, will most likely be complex enough that the
time needed for executing the control sequence is quite larger than the
time needed for scanning it, and copying them into a separate place
before looking them up is not likely to incur much overhead.

So in the course of the partial line buffer patch to PDFTeX, I might
introduce a separate place for assembly of control sequence names
(which would in LuaTeX then be the suitable point for reconversion to
utf-8), and the necessity for having control sequences and the input
buffer coded in the same manner might then mostly disappear.  Whether
one makes use of it depends on where one would prefer to have the
complexity: in the buffer handling code (which is distributed around
TeX) or elsewhere.  Not having to reconvert material from
\csname...\endcsname and \scantokens and similar to utf-8 might also
help to keep the utf-8 code complexity confined to few places.

Where else is utf-8 interesting?  One _very_ interesting place for it
are hyphenation tables.  It would seem to make a lot of sense to
switch this sort of data structure into something utf-8 based as long
as one makes sure that frequent prefix bytes (like C2) will not lead
to inefficiency due to collisions.  Since one can't specify
breakpoints in _legal_ utf sequences _anywhere_ inside of a
character's byte sequence, as long as the input is valid utf8, it
seems like a perfectly usable idea to go via UTF-8.  Disadvantage:
since hyphenation is working from char nodes, this again implies
having to convert UCS-32 to utf-8 before doing the hyphenation lookup.

The current Aleph-inherited hyphenation works only in the Unicode
baseplane and is 16-bit based and somewhat unstable (iniTeX will dump
core when trying to hyphenate).  Extending this to 21bit would either
necessitate making use of surrogate pairs in UTF-16, or going to
UTF-8.  Since the complexities in either case are pretty much the
same, UTF-8 would appear to be the more compact solution.

One further consideration: \lefthyphenmin and \righthyphenmin will not
be able to rely on the size of _byte_ strings.  However, since they
actually should rather count off grapheme clusters rather than single
character codes (combining accents should not count as characters for
\lefthyphenmin and \righthyphenmin, I think), one has to think about
those separately, anyway.

The hyphenation tables are special in that we would actually need to
have utf-8 exposed as a _byte_ sequence with values from 0..255.  Most
other uses could be implemented with some packed array semantics
inside of web2c, and using chr/ord.

Did I forget places where utf-8 might or might not crop up?

-- 
David Kastrup

David Kastrup

tags

participants (1)