Hi, David Kastrup wrote:
Hi, I've been wondering about several things with regard to Unicode/utf-8.
What is the situation in fonts (of any type) in general concerning the non-existing Unicode codepoints corresponding to the 16-bit surrogate codes D800-DFFF? Are there any fonts actually putting stuff there, if only ligatures?
I don't think I have ever seen one, but there could be. There is nothing in a 16-bit encoded font that forces it to use Unicode, after all. Just like 8-bit encoding does not enforce ASCII. In any case, the overful messages should not output UTF-8 sequences, because those are not characters, but glyphs, and that has been the case ever since 7-bit TeX82. I intend to switch to number representation for all glyphs that do not have a Unicode code point assigned, and that should fix the whole issue finally.
A compliant utf-8 file is not supposed to contain any codes in that area, so we would not want to have them appear as part of "overfull hbox" messages and similar if I am not mistaken.
But should we really bother testing against that? Who will care, except people that want theoretical perfection? I certainly don't. As long as a UTF-8 sequence can be transformed to an integer that fits the acceptable range, that is good enough.
I'd tend to move the special "output in byte-sized chunks" characters to "11xxxx: after all, fonts may contain stuff in the "10ffxx area,
Yes, you are right, they could. Switching to something that is completely out-of-range is not such a bad idea.
Ok, this is just a sketch (I consider the prospect disturbing of having to code without being able to rely internally on legal utf-8 sequences as long as possibly involved callbacks are bugfree), but the
Junk in, junk out. LuaTeX is not a file format validator, but a typesetting engine. That is what I think, anyway.
main question of this posting was what how surrogate code points are treated in fonts.
The input has characters, and these are transformed into glyphs. The input should adhere to UTF-8 conventions, but the font encoding doesn't have to. Even if this is not all true *right now*, it will so be before the official release. Is that clear enough? Best, Taco