[Dev-luatex] UTF-16 in \pdfoutline

Heiko Oberdiek oberdiek at uni-freiburg.de
Tue Dec 11 09:43:19 CET 2007

On Tue, Dec 11, 2007 at 09:07:31AM +0100, Jonathan Sauer wrote:

> % The following code converts a string to UTF-16 big endian with BOM
> % and outputs it using \message:

> 	function convertToUTF16(str)
> 		local result = string.char(0xFE) .. string.char(0xFF)
> 		for c in string.utfvalues(str) do
> 			if c < 0x10000 then
> 				result = result ..
> 						 string.char(c / 256) ..
> 						 string.char(c % 256)
> 			else

> 		tex.print('\\message{' .. result .. '}')

> This fails with 'Text line contains an invalid utf-8 sequence.' (not
> surprising, since the text is UTF-16 big endian). If I want to pass the
> UTF-16-encoded string i.e. to \pdfoutline (since PDF bookmarks can be
> encoded in UTF-16), how do I do this?

In package "pdftexcmds" I had a similar problem with the reprogramming
of \pdfunescapehex. Here also 8bit bytes are possible as result.
My solution was:
* Use of a token register (avoids catcode troubles), principle:
        function convert(str)
          tex.settoks("foobartoks", <result>)
  Alternatively tex.print may be used, but it requires a catcode table
  to avoid trouble with unexpected catcode settings. (Currently
  I am writing a package for this purpose.)
* The convert function in Lua first calculates the byte string.
  Then it have replaced the bytes in the lua string by its
  multi-byte sequence in UTF-8. Then LuaTeX will see proper
  UTF-8 input and converts it back.

Of course, I want to add support for hyperref. But currently I don't
know, what a char token between 128 and 256 means:
* Proper Unicode?
* A byte of other encoding?
* Meaning of catcode (11, 12, 13)?

Yours sincerely
  Heiko <oberdiek at uni-freiburg.de>

More information about the dev-luatex mailing list