On Tue Feb 9, 2021 at 7:49 PM CET, Hans Hagen wrote:
On 2/9/2021 6:57 PM, Michal Vlasák wrote:
Hello,
conversion to UTF-16BE PDF strings used for example in bookmarks / PDF outlines is not right.
Take the following example:
``` \starttext \setupinteraction[state=start] \placebookmarks[section][number=no]
\section[bookmark=𝕄]
\stoptext ```
Produces: <FEFF0075DD44> for 𝕄 (U+1D544), instead of the correct <FEFFD835DD44>.
The relevant function is `lpdf.tosixteen()` (from lpdf-ini.lmt), and its `cache`. (Although the same function is also in lpdf-aux.lmt, and in MkIV equivalents).
My proposal (also enclosed as a file attachment):
``` --- a/lpdf-ini.lmt +++ b/lpdf-ini.lmt @@ -178,7 +178,8 @@ if v < 0x10000 then v = format("%04x",v) else - v = format("%04x%04x",rshift(v,10),v%1024+0xDC00) + v = v - 0x10000 + v = format("%04x%04x",rshift(v,10)+0xD800,v%1024+0xDC00) end t[k] = v return v ```
(Note the similiarity to existing function `big()` in l-unicode.lua.)
I found this by chance, but I am not really a ConTeXt user, so I hope didn't miss anything.
Thanks for noticing (btw, the aux file is used on some scripts, not in context itself).
Hans
Unfortunately the version in latest LMTX is still not right. The subtraction of 0x10000 is really needed, at least for the high surrogate. (Note how the number is added back in the inverse function `lpdf.fromsixteen()`.) My other suggestion, which does the subtraction only for one surrogate is below. (Although I prefer my first suggestion, quoted above, which seems more clear - from number in range 0x10000 - 0x10FFFF subtract 0x10000, which makes it a number in range 0x0 - 0xFFFFF, a 20 bit number, the higher 10 bits are encoded into the higher surrogate (16 bits), by adding 0xD800 (so the remaining high 6 bits are 110110), and the lower 10 bits are encoded into the lower surrogate by adding 0xDC00 (high 6 bits are 110111).) Michal --- a/lpdf-ini.lmt +++ b/lpdf-ini.lmt @@ -176,7 +176,7 @@ if v < 0x10000 then v = format("%04x",v) else - v = format("%04x%04x",rshift(v,10)+0xD800,v%1024+0xDC00) + v = format("%04x%04x",rshift(v-0x10000,10)+0xD800,v%1024+0xDC00) end t[k] = v return v