[Dev-luatex] valid Unicode character treatment?

Taco Hoekwater taco at elvenkind.com
Tue Apr 24 15:42:55 CEST 2007


David Kastrup wrote:
> Hi, I've been wondering about several things with regard to
> Unicode/utf-8.
> What is the situation in fonts (of any type) in general concerning the
> non-existing Unicode codepoints corresponding to the 16-bit surrogate
> codes D800-DFFF?  Are there any fonts actually putting stuff there, if
> only ligatures?

I don't think I have ever seen one, but there could be. There is
nothing in a 16-bit encoded font that forces it to use Unicode,
after all. Just like 8-bit encoding does not enforce ASCII. In
any case, the overful messages should not output UTF-8 sequences,
because those are not characters, but glyphs, and that has been
the case ever since 7-bit TeX82.

I intend to switch to number representation for all glyphs that do
not have a Unicode code point assigned, and that should fix the
whole issue finally.

> A compliant utf-8 file is not supposed to contain any codes in that
> area, so we would not want to have them appear as part of "overfull
> hbox" messages and similar if I am not mistaken.

But should we really bother testing against that?  Who will care,
except people that want theoretical perfection? I certainly don't.
As long as a UTF-8 sequence can be transformed to an integer that
fits the acceptable range, that is good enough.

> I'd tend to move the special "output in byte-sized chunks" characters
> to "11xxxx: after all, fonts may contain stuff in the "10ffxx area,

Yes, you are right, they could. Switching to something that is
completely out-of-range is not such a bad idea.

> Ok, this is just a sketch (I consider the prospect disturbing of
> having to code without being able to rely internally on legal utf-8
> sequences as long as possibly involved callbacks are bugfree), but the

Junk in, junk out. LuaTeX is not a file format validator, but a
typesetting engine. That is what I think, anyway.

> main question of this posting was what how surrogate code points are
> treated in fonts.

The input has characters, and these are transformed into glyphs. The
input should adhere to UTF-8 conventions, but the font encoding doesn't
have to. Even if this is not all true *right now*, it will so be before
the official release. Is that clear enough?

Best, Taco

More information about the dev-luatex mailing list