[Dev-luatex] valid Unicode character treatment?

24 Apr 2007

      Hi, I've been wondering about several things with regard to
Unicode/utf-8.

What is the situation in fonts (of any type) in general concerning the
non-existing Unicode codepoints corresponding to the 16-bit surrogate
codes D800-DFFF?  Are there any fonts actually putting stuff there, if
only ligatures?

A compliant utf-8 file is not supposed to contain any codes in that
area, so we would not want to have them appear as part of "overfull
hbox" messages and similar if I am not mistaken.

I am currently thinking about a utf-8 strategy that would be least
prone to causing internal inconsistencies: basically I think that
certain properties of _legal_ utf-8 should be guaranteed inside of
LuaTeX, like that the number of characters in a string being equal to
the number of bytes outside of the 80-BF code range, that characters
are encoded with minimal length, that the number of bytes never
exceeds 4 times the number of characters and similar things.

I'd tend to move the special "output in byte-sized chunks" characters
to "11xxxx: after all, fonts may contain stuff in the "10ffxx area,
and overfull box messages will output those characters.

Those can be represented internally by (basically out of range) utf-8
sequences in the obvious way.  If the input reader for utf-8 cranks
out the corresponding "output in byte-sized chunks" characters for
illegal utf-8 byte sequences, then inputting them accidentally will
usually lead to "missing character" errors, but it will be possible to
write stuff like
\message{^^11xxxx}
to produce verbatim output, and
\message{illegal byte sequence}
will reproduce the illegal byte sequence unchanged, without having any
illegal byte sequence (apart from the codes for "11xxxx) present
in the innards of LuaTeX.

I'd think it reasonable not to permit those characters "11xxxx into
the normal character arrays (lccode, uccode, chardef ...) in a manner
similar to how the codes from "80 to "ff were treated in TeX-2.x
(which had 7-bit arrays inside, but accepted 256 characters in fonts
and input).

When I write "11xxxx instead of "1100xx it is because I don't yet have
a clear idea about whether or how one would bother thinking about
transparent word output when using UCS-16 (which has surrogate
characters and stuff).  Possibly one should just completely forget
about facilitating UCS-16 output, whether through callbacks or
otherwise.

Ok, this is just a sketch (I consider the prospect disturbing of
having to code without being able to rely internally on legal utf-8
sequences as long as possibly involved callbacks are bugfree), but the
main question of this posting was what how surrogate code points are
treated in fonts.

Thanks,
David

-- 
David Kastrup