Incorrect UTF-8 decoding in \luaescapestring (+ fix)
Hello, the following PlainTeX example illustrates the problem: %&luatex \directlua0{% % The following character is DESERET CAPITAL LETTER LONG I, % Unicode 0x10400, encoded in UTF-8 as F0 90 90 80: local s = '\luaescapestring{𐐀}'% A % local s = '𐐀'% B for c in string.bytes(s) do texio.write_nl(c) end }% \end (A) results in: This is LuaTeX, Version snapshot-0.25.0-2008031419 (Web2C 7.5.6) (EscapeUTF8.tex! Pool contains an invalid utf-8 sequence . l.4 local s = '\luaescapestring{????} '% A ? 240 144 144 128 239 191 189 ) No pages of output. Transcript written on EscapeUTF8.log. (B) results in: This is LuaTeX, Version snapshot-0.25.0-2008031419 (Web2C 7.5.6) (EscapeUTF8.tex 240 144 144 128 ) No pages of output. Transcript written on EscapeUTF8.log. It seems that \luaescapestring does not handle long UTF-8 sequences correctly. The additional three bytes above -- 239-191-189 or EF-BF-BD -- encode Unicode character FFFD -- REPLACEMENT CHARACTER -- in UTF-8, the character LuaTeX inserts when encountering an invalid UTF-8 sequence. I think that the error lies in luatex.web, line 10911: @d unicode_incr(#)==if str_pool[#]>@"F0 then #:=#+4 else if str_pool[#]>@"E0 then #:=#+3 else if str_pool[#]>@"C0 then #:=#+2 else incr(#) Now instead of skipping the entire four-byte-sequence, only the first three bytes are skipped and the last byte, hex 80, is left for processing as the next character. Since 0x80 represents an invalid UTF-8 sequence, LuaTeX displays above error message and continues with 0xFFFD. So I think '>=' should be used, instead of '>': @d unicode_incr(#)==if str_pool[#]>=@"F0 then #:=#+4 else if str_pool[#]>=@"E0 then #:=#+3 else if str_pool[#]>=@"C0 then #:=#+2 else incr(#) I tried this modification, and the error disappeared along with the additional three bytes. Jonathan P.S: How does the bug tracker work? I tried to register some weeks ago, but never got the confirmation mail with the password. When trying to register again now, it says the username is already being used, even though non-activated accounts should be purged after a week.
Jonathan Sauer wrote:
Hello,
the following PlainTeX example illustrates the problem: [...]
So I think '>=' should be used, instead of '>':
@d unicode_incr(#)==if str_pool[#]>=@"F0 then #:=#+4 else if str_pool[#]>=@"E0 then #:=#+3 else if str_pool[#]>=@"C0 then #:=#+2 else incr(#)
I tried this modification, and the error disappeared along with the additional three bytes.
Thanks, applied!
P.S: How does the bug tracker work? I tried to register some weeks ago, but never got the confirmation mail with the password. When trying to register again now, it says the username is already being used, even though non-activated accounts should be purged after a week.
I am having a problem with my service reseller: the server the tracker runs on doesn't have a reverse hostname yet. That makes your smtp server (mailer2.silverstroke.com) reject all email from that server. That is a bit harsh really, but probably nothing you can do about. I even forwarded you a few bounced messages. But then, those were probably filtered out on your side as well. Isn't spam fun? Look out for a private email message from me in a minute. Best wishes, Taco
participants (2)
-
Jonathan Sauer
-
Taco Hoekwater