Re: [NTG-context] Unicode question

12 Mar 2015


      ...
The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD)
        utf_error();
    return (val);
in a function str2uni. I didn't really try to understand the code
but it looks as if 0xFFFD is used as "invalid marker":
Interesting.  This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that.

Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.
...
The comment in the code says
/* the 5- and 6-byte UTF-8 sequences generate integers
that are outside of the valid UCS range, and therefore
unsupported 
         */
That's correct, the longest valid UTF-8 sequence is 4 bytes.

Best,

Arthur

Re: [NTG-context] Unicode question

Arthur Reutenauer