% The following code converts a string to UTF-16 big endian with BOM % and outputs it using \message: % We change the catcode of '%' so we can use it for modulo calculations: \begingroup \catcode`\%=12 \directlua0{\unexpanded{ function convertToUTF16(str) local result = string.char(0xFE) .. string.char(0xFF) for c in string.utfvalues(str) do if c < 0x10000 then result = result .. string.char(c / 256) .. string.char(c % 256) else c = c - 0x10000 local c1 = c / 1024 + 0xD800 local c2 = c % 1024 + 0xDC00 result = result .. string.char(c1 / 256) .. string.char(c1 % 256) .. string.char(c2 / 256) .. string.char(c2 % 256) end end tex.print('\\message{' .. result .. '}') end convertToUTF16('AäöüB!') }} \endgroup \bye This fails with 'Text line contains an invalid utf-8 sequence.' (not surprising, since the text is UTF-16 big endian). If I want to pass the UTF-16-encoded string i.e. to \pdfoutline (since PDF bookmarks can be encoded in UTF-16), how do I do this? (Maybe a callback would be useful, i.e. `convert_pdf_text') Thanks in advance, Jonathan
Jonathan Sauer wrote:
% The following code converts a string to UTF-16 big endian with BOM % and outputs it using \message:
% We change the catcode of '%' so we can use it for modulo calculations: \begingroup \catcode`\%=12 \directlua0{\unexpanded{ function convertToUTF16(str) local result = string.char(0xFE) .. string.char(0xFF) for c in string.utfvalues(str) do if c < 0x10000 then result = result .. string.char(c / 256) .. string.char(c % 256) else c = c - 0x10000 local c1 = c / 1024 + 0xD800 local c2 = c % 1024 + 0xDC00 result = result .. string.char(c1 / 256) .. string.char(c1 % 256) .. string.char(c2 / 256) .. string.char(c2 % 256) end end tex.print('\\message{' .. result .. '}') end
convertToUTF16('AäöüB!') }} \endgroup
\bye
This fails with 'Text line contains an invalid utf-8 sequence.' (not surprising, since the text is UTF-16 big endian). If I want to pass the UTF-16-encoded string i.e. to \pdfoutline (since PDF bookmarks can be encoded in UTF-16), how do I do this?
(Maybe a callback would be useful, i.e. `convert_pdf_text')
you can move all 'bytes' to a reserved private area (see manual) and then luatex will write them with that offset subtracted; think of a private 256 slot area representing bytes Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen wrote:
This fails with 'Text line contains an invalid utf-8 sequence.' (not surprising, since the text is UTF-16 big endian). If I want to pass the UTF-16-encoded string i.e. to \pdfoutline (since PDF bookmarks can be encoded in UTF-16), how do I do this?
(Maybe a callback would be useful, i.e. `convert_pdf_text')
you can move all 'bytes' to a reserved private area (see manual) and then luatex will write them with that offset subtracted; think of a private 256 slot area representing bytes
For now, that is what will have to be done. But I agree that is ugly, and either a callback or primitive or lua pdf.outline() access is called for. Best wishes, Taco
On Tue, Dec 11, 2007 at 09:07:31AM +0100, Jonathan Sauer wrote:
% The following code converts a string to UTF-16 big endian with BOM % and outputs it using \message:
function convertToUTF16(str) local result = string.char(0xFE) .. string.char(0xFF) for c in string.utfvalues(str) do if c < 0x10000 then result = result .. string.char(c / 256) .. string.char(c % 256) else
tex.print('\\message{' .. result .. '}')
This fails with 'Text line contains an invalid utf-8 sequence.' (not surprising, since the text is UTF-16 big endian). If I want to pass the UTF-16-encoded string i.e. to \pdfoutline (since PDF bookmarks can be encoded in UTF-16), how do I do this?
In package "pdftexcmds" I had a similar problem with the reprogramming
of \pdfunescapehex. Here also 8bit bytes are possible as result.
My solution was:
* Use of a token register (avoids catcode troubles), principle:
\newtoks\foobartoks
\directlua0{
...
function convert(str)
...
tex.settoks("foobartoks", <result>)
...
end
}
\def\convert#1{%
\the\directlua0{%
convert("\luaescapestring{#1}")
}\foobartoks
}
Alternatively tex.print may be used, but it requires a catcode table
to avoid trouble with unexpected catcode settings. (Currently
I am writing a package for this purpose.)
* The convert function in Lua first calculates the byte string.
Then it have replaced the bytes in the lua string by its
multi-byte sequence in UTF-8. Then LuaTeX will see proper
UTF-8 input and converts it back.
Of course, I want to add support for hyperref. But currently I don't
know, what a char token between 128 and 256 means:
* Proper Unicode?
* A byte of other encoding?
* Meaning of catcode (11, 12, 13)?
Yours sincerely
Heiko
participants (4)
-
Hans Hagen
-
Heiko Oberdiek
-
Jonathan Sauer
-
Taco Hoekwater