Hi, I've been wondering about several things with regard to Unicode/utf-8. What is the situation in fonts (of any type) in general concerning the non-existing Unicode codepoints corresponding to the 16-bit surrogate codes D800-DFFF? Are there any fonts actually putting stuff there, if only ligatures? A compliant utf-8 file is not supposed to contain any codes in that area, so we would not want to have them appear as part of "overfull hbox" messages and similar if I am not mistaken. I am currently thinking about a utf-8 strategy that would be least prone to causing internal inconsistencies: basically I think that certain properties of _legal_ utf-8 should be guaranteed inside of LuaTeX, like that the number of characters in a string being equal to the number of bytes outside of the 80-BF code range, that characters are encoded with minimal length, that the number of bytes never exceeds 4 times the number of characters and similar things. I'd tend to move the special "output in byte-sized chunks" characters to "11xxxx: after all, fonts may contain stuff in the "10ffxx area, and overfull box messages will output those characters. Those can be represented internally by (basically out of range) utf-8 sequences in the obvious way. If the input reader for utf-8 cranks out the corresponding "output in byte-sized chunks" characters for illegal utf-8 byte sequences, then inputting them accidentally will usually lead to "missing character" errors, but it will be possible to write stuff like \message{^^11xxxx} to produce verbatim output, and \message{illegal byte sequence} will reproduce the illegal byte sequence unchanged, without having any illegal byte sequence (apart from the codes for "11xxxx) present in the innards of LuaTeX. I'd think it reasonable not to permit those characters "11xxxx into the normal character arrays (lccode, uccode, chardef ...) in a manner similar to how the codes from "80 to "ff were treated in TeX-2.x (which had 7-bit arrays inside, but accepted 256 characters in fonts and input). When I write "11xxxx instead of "1100xx it is because I don't yet have a clear idea about whether or how one would bother thinking about transparent word output when using UCS-16 (which has surrogate characters and stuff). Possibly one should just completely forget about facilitating UCS-16 output, whether through callbacks or otherwise. Ok, this is just a sketch (I consider the prospect disturbing of having to code without being able to rely internally on legal utf-8 sequences as long as possibly involved callbacks are bugfree), but the main question of this posting was what how surrogate code points are treated in fonts. Thanks, David -- David Kastrup