Hi all, If I run this minimal example \starttext � \stopluacode \stoptext I get tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence and some more lines. The character above is: Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd which is a valid utf8 character. Questions: 1. Why is it considered to be invalid? 2. Are there other valid utf8 characters which are considered invalid? Just wanting to understand. -- Manfred
Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:
Hi all, If I run this minimal example
\starttext
�
\stopluacode
\stoptext
I get
tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence
and some more lines.
The character above is:
Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd
which is a valid utf8 character.
Questions:
1. Why is it considered to be invalid?
This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain) The luatex code contains the lines (in unistring.w) if (val == 0xFFFD) utf_error(); return (val); in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.
2. Are there other valid utf8 characters which are considered invalid?
The comment in the code says /* the 5- and 6-byte UTF-8 sequences generate integers that are outside of the valid UCS range, and therefore unsupported */ -- Ulrike Fischer http://www.troubleshooting-tex.de/
The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD) utf_error(); return (val);
in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":
Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that. Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.
The comment in the code says
/* the 5- and 6-byte UTF-8 sequences generate integers
that are outside of the valid UCS range, and therefore
unsupported */
That's correct, the longest valid UTF-8 sequence is 4 bytes. Best, Arthur
Hi Arthur,
On Thu, 12 Mar 2015 16:35:47 +0000
Arthur Reutenauer
The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD) utf_error(); return (val);
in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":
Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that.
Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.
Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+FFFF sounds good as both are really invalid. -- Best, Manfred
On 3/12/2015 7:08 PM, Manfred Lotz wrote:
Hi Arthur,
On Thu, 12 Mar 2015 16:35:47 +0000 Arthur Reutenauer
wrote: The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD) utf_error(); return (val);
in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":
Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that.
Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.
Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+FFFF sounds good as both are really invalid.
it's an attempt to recover but in the process a normal 0xFFFD triggers an error too; recovering to 0xFFFD for a really invalid input is ok as tex does that in more cases: i expected a } so i insert one here ... cross your fingers etc Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Thu, 12 Mar 2015 16:41:59 +0100
Ulrike Fischer
Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:
Hi all, If I run this minimal example
\starttext
�
\stopluacode
\stoptext
I get
tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence
and some more lines.
The character above is:
Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd
which is a valid utf8 character.
Questions:
1. Why is it considered to be invalid?
This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain)
Yes, I know.
The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD) utf_error(); return (val);
in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.
Took me a while to find the repository but finally I got it.
2. Are there other valid utf8 characters which are considered invalid?
The comment in the code says
/* the 5- and 6-byte UTF-8 sequences generate integers
that are outside of the valid UCS range, and therefore
unsupported */
Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error(); the error message tex_error("String contains an invalid utf-8 sequence", hlp); to be displayed. Ok, this answers my question. Thanks for the pointer. -- Manfred
On 3/12/2015 6:57 PM, Manfred Lotz wrote:
On Thu, 12 Mar 2015 16:41:59 +0100 Ulrike Fischer
wrote: Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:
Hi all, If I run this minimal example
\starttext
�
\stopluacode
\stoptext
I get
tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence
and some more lines.
The character above is:
Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd
which is a valid utf8 character.
Questions:
1. Why is it considered to be invalid?
This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain)
Yes, I know.
The luatex code contains the lines (in unistring.w)
if (val == 0xFFFD) utf_error(); return (val);
in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.
Took me a while to find the repository but finally I got it.
2. Are there other valid utf8 characters which are considered invalid?
The comment in the code says
/* the 5- and 6-byte UTF-8 sequences generate integers
that are outside of the valid UCS range, and therefore
unsupported */
Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error();
the error message tex_error("String contains an invalid utf-8 sequence", hlp);
to be displayed.
Ok, this answers my question.
Thanks for the pointer.
it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen
it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input
FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode
The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted. -- luigi
On 3/12/2015 9:41 PM, luigi scarso wrote:
On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen
mailto:pragma@wxs.nl> wrote: it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input
FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode
the question is not what to do when an invalid character comes in, in that case luatex can replace it by 0xFFFD and issue a error as now, but when the input hasn't an 0xFFFD then luatex should just carry on as 0xFFFD is a *valid* character it is quite easy for a macro package to trigger an error as \catcode"FFFD=15 will do thatm but it's impossible for a macro package to intercept the weird interception by luatex's input handler
The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted.
you cannot force a user to use \batchmode and -1 would abort a wrapper thereby leading to an invalid document; it means that luatex can never typeset a document where char 0xFFFD is being typeset and luatex should not be normative not accepting 0xFFFD in the input is a bug Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (5)
-
Arthur Reutenauer
-
Hans Hagen
-
luigi scarso
-
Manfred Lotz
-
Ulrike Fischer