Unicode question

Manfred Lotz

12 Mar 2015 12 Mar '15

1:48 a.m.

Hi all, If I run this minimal example \starttext � \stopluacode \stoptext I get tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence and some more lines. The character above is: Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd which is a valid utf8 character. Questions: 1. Why is it considered to be invalid? 2. Are there other valid utf8 characters which are considered invalid? Just wanting to understand. -- Manfred

Show replies by date

Ulrike Fischer

12 Mar 12 Mar

9:41 a.m.

Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:

...

Hi all, If I run this minimal example

\starttext

�

\stopluacode

\stoptext

I get

tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence

and some more lines.

The character above is:

Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd

which is a valid utf8 character.

Questions:

1. Why is it considered to be invalid?

This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain) The luatex code contains the lines (in unistring.w) if (val == 0xFFFD) utf_error(); return (val); in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.

...

2. Are there other valid utf8 characters which are considered invalid?

The comment in the code says /* the 5- and 6-byte UTF-8 sequences generate integers that are outside of the valid UCS range, and therefore unsupported */ -- Ulrike Fischer http://www.troubleshooting-tex.de/

Arthur Reutenauer

10:35 a.m.

...

The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD) utf_error(); return (val);

in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":

Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that. Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.

...

The comment in the code says

/* the 5- and 6-byte UTF-8 sequences generate integers

that are outside of the valid UCS range, and therefore

unsupported */

That's correct, the longest valid UTF-8 sequence is 4 bytes. Best, Arthur

Manfred Lotz

12:08 p.m.

Hi Arthur, On Thu, 12 Mar 2015 16:35:47 +0000 Arthur Reutenauer wrote:

...

...
The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD) utf_error(); return (val);

in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":

Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that.

Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.

Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+FFFF sounds good as both are really invalid. -- Best, Manfred

Hans Hagen

1:04 p.m.

On 3/12/2015 7:08 PM, Manfred Lotz wrote:

...

Hi Arthur,

On Thu, 12 Mar 2015 16:35:47 +0000 Arthur Reutenauer wrote:

...
...
The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD) utf_error(); return (val);

in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker":

Interesting. This is not actually correct, U+FFFD is a valid Unicode character; it would be better to use U+FFFE or U+FFFF for that.

Note that U+FFFD is the recommended character to use when a character can't be recognised while converting to Unicode from another encoding, so its presence is usually a sign that something went wrong upstream, but I assume Manfred is aware of that.

Yes, I'm aware of that. So I also think that it isn't correct to use U+FFFD for this. Your suggestion of using either U+FFFE or U+FFFF sounds good as both are really invalid.

it's an attempt to recover but in the process a normal 0xFFFD triggers an error too; recovering to 0xFFFD for a really invalid input is ok as tex does that in more cases: i expected a } so i insert one here ... cross your fingers etc Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Manfred Lotz

11:57 a.m.

On Thu, 12 Mar 2015 16:41:59 +0100 Ulrike Fischer wrote:

...

Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:

...
Hi all, If I run this minimal example

\starttext

�

\stopluacode

\stoptext

I get

tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence

and some more lines.

The character above is:

Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd

which is a valid utf8 character.

Questions:

1. Why is it considered to be invalid?

This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain)

Yes, I know.

...

The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD) utf_error(); return (val);

in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.

Took me a while to find the repository but finally I got it.

...

...
2. Are there other valid utf8 characters which are considered invalid?

The comment in the code says

/* the 5- and 6-byte UTF-8 sequences generate integers

that are outside of the valid UCS range, and therefore

unsupported */

Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error(); the error message tex_error("String contains an invalid utf-8 sequence", hlp); to be displayed. Ok, this answers my question. Thanks for the pointer. -- Manfred

Hans Hagen

12:55 p.m.

On 3/12/2015 6:57 PM, Manfred Lotz wrote:

...

On Thu, 12 Mar 2015 16:41:59 +0100 Ulrike Fischer wrote:

...
Am Thu, 12 Mar 2015 08:48:27 +0100 schrieb Manfred Lotz:

...
Hi all, If I run this minimal example

\starttext

�

\stopluacode

\stoptext

I get

tex error > error on line 3 in file /data/tmp/u1.tex: ! String contains an invalid utf-8 sequence

and some more lines.

The character above is:

Character: � Character name: REPLACEMENT CHARACTER Charblock: Specials Category: Other symbol Unicode: U+fffd UTF8: 0xefbfbd

which is a valid utf8 character.

Questions:

1. Why is it considered to be invalid?

This is not a context question/problem but related to the binary (you would get the same error with lualatex or plain)

Yes, I know.

...
The luatex code contains the lines (in unistring.w)

if (val == 0xFFFD) utf_error(); return (val);

in a function str2uni. I didn't really try to understand the code but it looks as if 0xFFFD is used as "invalid marker": If luatex encounters something that isn't valid utf8 it maps val to 0xFFFD and then test against 0xFFFD to rise an error.

Took me a while to find the repository but finally I got it.

...
...
2. Are there other valid utf8 characters which are considered invalid?

The comment in the code says

/* the 5- and 6-byte UTF-8 sequences generate integers

that are outside of the valid UCS range, and therefore

unsupported */

Well, it is called REPLACEMENT CHARACTER, and it seems that this character will be used to replace invalid characters. Then it causes if (val == 0xFFFD) utf_error();

the error message tex_error("String contains an invalid utf-8 sequence", hlp);

to be displayed.

Ok, this answers my question.

Thanks for the pointer.

it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

luigi scarso

2:41 p.m.

On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen wrote:

...

it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input

FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode

The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted. -- luigi

Hans Hagen

2:52 p.m.

On 3/12/2015 9:41 PM, luigi scarso wrote:

...

On Thu, Mar 12, 2015 at 7:55 PM, Hans Hagen mailto:pragma@wxs.nl> wrote:

it's actually a bug ... it is ok to map an invalid character in the input to 0xFFFD, halt and continue when permitted, but the method used in luatex thereby obscures a valid 0xFFFD in the input

FFFD REPLACEMENT CHARACTER • used to replace an incoming character whose value is unknown or unrepresentable in Unicode

the question is not what to do when an invalid character comes in, in that case luatex can replace it by 0xFFFD and issue a error as now, but when the input hasn't an 0xFFFD then luatex should just carry on as 0xFFFD is a *valid* character it is quite easy for a macro package to trigger an error as \catcode"FFFD=15 will do thatm but it's impossible for a macro package to intercept the weird interception by luatex's input handler

...

The meaning of FFFD is not "typeset a question mark on a black box" as in � (which depends to font in anycase so in principle it's possible to see something completely different in a new version of the font) but to signal something potentially wrong with a symbol that currently in most cases is �. Misusing the meaning is not bad di per se, but in this specific case I think luatex is correct to be conservative and ask to the user what to do; context --batchmode typesets the document, writes the messages on the log, and ends with -1 , so an automatic agent is also alerted.

you cannot force a user to use \batchmode and -1 would abort a wrapper thereby leading to an invalid document; it means that luatex can never typeset a document where char 0xFFFD is being typeset and luatex should not be normative not accepting 0xFFFD in the input is a bug Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

3767

Age (days ago)

3767

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Arthur Reutenauer
Hans Hagen
luigi scarso
Manfred Lotz
Ulrike Fischer