Strings in LuaTeX's pdfe
Hi, I recently tried to do something with the embedded pdfe library and noticed that accessing strings comes with certain problems. PDF strings are always returned in raw form without the surrounding <> or (), so any script using them will need to know if it is a hex string or a "normal" () delimited string in order to treat it correctly. So pdfe.getstring is a bit weird: It gives a Lua string but no indication which type of string is returned. So if pdfe.getstring e.g. returns "425", it can be either correspond to the actual text "425" or it can be the hexadecimal encoding of "BP". Given that PDF allows beoth upper and lowercase letters and even an odd number of digits in a hexadecimal string, even guessing the right format is hard and error-prone, making pdf.getstring not particularly useful. The same issue appears with the `__index` metafunctions of dictionaries and arrays. This is especially weird because it's inconsistent with PDF names which always get decoded before they are passed to the user. Also even after the Lua script figures out if it is a hex string or a literal string, it has to decode it. (Of course this part only applies if the actual value is needed and not if it only should be passed into another PDF string) That's not complicated, but it feels weird: After all, the underlying pplib already decoded the string, so it seems like it would be easier to make this decoded version accessible to the user. So would it be possible to maybe either change the existing functions or add new ones to 1. return the already decoded value and/or 2. give an indication if a literal or a hex string is returned? Best regards, Marcel
Am Fri, 3 Apr 2020 02:08:25 +0200 schrieb Marcel Fabian Krüger:
Hi,
I recently tried to do something with the embedded pdfe library and noticed that accessing strings comes with certain problems. PDF strings are always returned in raw form without the surrounding <> or (), so any script using them will need to know if it is a hex string or a "normal" () delimited string in order to treat it correctly. So pdfe.getstring is a bit weird: It gives a Lua string but no indication which type of string is returned.
I just run into the same problem and used the detail field from getfromdictionary/getfromarray to access the string type. But I agree that it would be nice, if getstring would return this directly \documentclass{article} \begin{document} \directlua{ doc= pdfe.open(kpse.find_file("example-image.pdf")) trailerid = pdfe.getarray(pdfe.gettrailer (doc),"ID") type,value,detail = pdfe.getfromarray(trailerid,1) if detail then print("HEXSTRING", value) else print("LITERALSTRING", value) end type,value,detail = pdfe.getfromdictionary(pdfe.getinfo(doc),"Creator") if detail then print("HEXSTRING", value) else print("LITERALSTRING", value) end } blub \end{document} -- Ulrike Fischer http://www.troubleshooting-tex.de/
Am Fri, 3 Apr 2020 02:08:25 +0200 schrieb Marcel Fabian Krüger:
Hi,
I recently tried to do something with the embedded pdfe library and noticed that accessing strings comes with certain problems. PDF strings are always returned in raw form without the surrounding <> or (), so any script using them will need to know if it is a hex string or a "normal" () delimited string in order to treat it correctly. So pdfe.getstring is a bit weird: It gives a Lua string but no indication which type of string is returned.
I just run into the same problem and used the detail field from getfromdictionary/getfromarray to access the string type. But I agree that it would be nice, if getstring would return this directly
\documentclass{article} \begin{document} \directlua{ doc= pdfe.open(kpse.find_file("example-image.pdf")) trailerid = pdfe.getarray(pdfe.gettrailer (doc),"ID") type,value,detail = pdfe.getfromarray(trailerid,1) if detail then print("HEXSTRING", value) else print("LITERALSTRING", value) end type,value,detail = pdfe.getfromdictionary(pdfe.getinfo(doc),"Creator") if detail then print("HEXSTRING", value) else print("LITERALSTRING", value) end } blub \end{document}
On 4/3/2020 8:33 AM, Ulrike Fischer wrote: the problem with return values for these basic types (string, number, boolean) is that when they are used in arguments (and such) one then need to encapsulate them in () to make sure that the first argument is used (the string value) ... in the end these are strings (no matter if they are hex encoded or not) print("STRING", pdfe.getstring(trailerid,1)) adding an extra return value is no big deal but we can't predict incompatibilities (and we're not assumed to introduce these) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 4/5/2020 12:46 PM, luigi scarso wrote:
On Sun, Apr 5, 2020 at 12:35 PM Hans Hagen
mailto:j.hagen@xs4all.nl> wrote: adding an extra return value is no big deal but we can't predict incompatibilities (and we're not assumed to introduce these)
no modification, in case we will add pdfe.getpdfstring and mark getstring as deprecated.
one can define that as helper in lua if needed: function pdfe.getpdfstring(n,m) if pdfe.type(n) == "pdfe.array" then local t, v, d = pdfe.getfromarray(n,m) return v, d elseif pdfe.type(n) == "pdfe.dictionary" then local t, v, d = pdfe.getfromdictionary(n,m) return v, d end end ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Am Sun, 5 Apr 2020 13:10:27 +0200 schrieb Hans Hagen:
if pdfe.type(n) == "pdfe.array"
pdfe.type is not in the documentation ;-). -- Ulrike Fischer http://www.troubleshooting-tex.de/
On Sun, Apr 05, 2020 at 12:46:28PM +0200, luigi scarso wrote:
On Sun, Apr 5, 2020 at 12:35 PM Hans Hagen
wrote: adding an extra return value is no big deal but we can't predict incompatibilities (and we're not assumed to introduce these)
no modification, in case we will add pdfe.getpdfstring and mark getstring as deprecated.
Thank you very much for adding the extended pdfe.getstring. This makes using the library much nicer :) -- Marcel
On Fri, Apr 3, 2020 at 2:08 AM Marcel Fabian Krüger
Hi,
I recently tried to do something with the embedded pdfe library and noticed that accessing strings comes with certain problems. PDF strings are always returned in raw form without the surrounding <> or (), so any script using them will need to know if it is a hex string or a "normal" () delimited string in order to treat it correctly.
yes, seen it. I will think about it. -- luigi
participants (4)
-
Hans Hagen
-
luigi scarso
-
Marcel Fabian Krüger
-
Ulrike Fischer