Deterministic PDFs (switch to disable addition of timestamps and random ID nonces)
For some applications, especially where the output of pdftex lives in any kind of version-control system, it would be extremely helpful to have a switch that causes pdftex to output deterministic PDF files. At present, there are three fields in each PDF generated by pdfTeX that change at each run on identical source files: $ pdftex '\hbox{}\vfill\eject\end' Output written on texput.pdf (1 page, 8182 bytes). $ mv texput.pdf texput2.pdf $ pdftex '\hbox{}\vfill\eject\end' Output written on texput.pdf (1 page, 8182 bytes). $ diff -a texput.pdf texput2.pdf 51,52c51,52 < /CreationDate (D:20150619105331+01'00') < /ModDate (D:20150619105331+01'00') ---
/CreationDate (D:20150619105323+01'00') /ModDate (D:20150619105323+01'00') 76c76 < /ID [<BA26D54E6C14BF8195356C4A7162A40A> <BA26D54E6C14BF8195356C4A7162A40A>]
/ID [<D5B098C606C27E75987095F36CDCFED7> <D5B098C606C27E75987095F36CDCFED7>]
Looking at the PDF 1.7 specification in ISO 32000 at http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_200... I believe that all these non-deterministic elements are optional, and could be suppressed without violating the PDF specification: Table 317 on page 550 says that the CreatingDate entry of the document information directory is optional, and the ModDate entry is optional unless PieceInfo is present. Pdftex does not output PieceInfo, right? So ModDate is also optional here. Likewise, the ID element appears to be optional as long as the PDF document is not encrypted. For DVI mode, we have already "-output-comment ''" to suppress the addition of a timestamp. Feature request: Would it be possible to add a switch, for example a command-line option like "-deterministic-output" that suppresses the automatic addition of timestamps and random ID byte strings to the generated PDF, such that repeated runs of pdftex produce identical output? Thanks, Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
helpful to have a switch that causes pdftex to output deterministic PDF files. Agreed. (As I recall, another person made the same request not long ago.) I hope we will do this for the next release. Thanks. --karl
On 6/23/2015 6:39 PM, Karl Berry wrote:
helpful to have a switch that causes pdftex to output deterministic PDF files.
Agreed. (As I recall, another person made the same request not long ago.) I hope we will do this for the next release. Thanks. --karl
One can influence what goes into info, like: % engine=pdftex \starttext \pdfcompresslevel0 \pdfobjcompresslevel0 \pdfinfo{/CreationDate/null} \pdfinfo{/ModDate/null} \pdfinfo{/ID/null} test \stoptext (in context lingua) So, one can also set some constants that way. I see no reason for a patch. Use (...) for a more meaningful value. In fact, one can always replace values like that afterwards with a script as long as the size of the file is kept (last time i did that was over a decade ago). Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 25/06/15 15:25, Hans Hagen wrote:
\pdfinfo{/CreationDate/null} \pdfinfo{/ModDate/null} \pdfinfo{/ID/null}
Two problems with this: - While your suggestion does make /CreationDate and /ModDate deterministic, I believe it leads to an output file that violates the ISO 32000 PDF 1.7 standard, which says in section 14.3.3 (p 549): "Any entry whose value is not known should be omitted from the dictionary rather than included with an empty string as its value." http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_200... Your solution leads to /CreationDate/null/ModDate/null/ID/null in that dictionary. - The random /ID is *not* an entry in the "document information dictionary" (the "info" entry in the file trailer that \pdfinfo modifies), which is why your attempt to modify it with \pdfinfo did not affect its 32-hex digit value. (When testing from a script, make sure that the two invocations to pdftex are at least one second apart, otherwise their outputs will be identical simply due the 1-second resolution of the timestamps.) Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Hello, Correction for the time settings in the information dictionary -------------------------------------------------------------- On 25.06.2015 16:25, Hans Hagen wrote:
\pdfinfo{/CreationDate/null} \pdfinfo{/ModDate/null}
A typo, "/null" is a name object with name "null". Very likely the null object was intended. The null object as value in a dictionary means that the key is not present; pdfTeX scans for "/CreationDate" and "/ModDate" and suppresses its own versions, if they are found. (The full list is "/Creator", "/Producer", "/CreationDate", "/ModDate", and "/Trapped".) Correct version: \pdfinfo{/CreationDate null} \pdfinfo{/ModDate null} Alternatively deterministic date values can be set there in PDF date format, e.g.: \pdfinfo{/ModDate (D:20150626031503+02'00')}
\pdfinfo{/ID/null}
This line was likely intended to remove the "/ID" entry. The /ID key is in the trailer dictionary. But \pdftrailer{/ID null} will not work either, because pdfTeX sets its own /ID values regardless, what the user might have provided in \pdftrailer. pdfTeX calculates the non-deterministic ID values in "utils.c", function "printID". It uses the MD5 sum of the following data for the ID values: * the current time by calling function "time" (resolution is second), * the current working directory by calling "getcwd" and * the output file name. The workaround is to overwrite the two hexadecimal string of the two ID values with a deterministic hex string of the same length. The ID key can be found at the end of the PDF file and, the dictionary is uncompressed (trailer without object compression of PDF 1.5 or the dictionary of type XRef in case of object compression. Yours sincerely Heiko Oberdiek
On 6/26/2015 3:40 AM, Heiko Oberdiek wrote:
Hello,
Correction for the time settings in the information dictionary --------------------------------------------------------------
On 25.06.2015 16:25, Hans Hagen wrote:
\pdfinfo{/CreationDate/null} \pdfinfo{/ModDate/null}
A typo, "/null" is a name object with name "null".
indeed
Very likely the null object was intended. The null object as value in a dictionary means that the key is not present; pdfTeX scans for "/CreationDate" and "/ModDate" and suppresses its own versions, if they are found. (The full list is "/Creator", "/Producer", "/CreationDate", "/ModDate", and "/Trapped".)
Correct version:
\pdfinfo{/CreationDate null} \pdfinfo{/ModDate null}
Alternatively deterministic date values can be set there in PDF date format, e.g.: \pdfinfo{/ModDate (D:20150626031503+02'00')}
\pdfinfo{/ID/null}
This line was likely intended to remove the "/ID" entry. The /ID key is in the trailer dictionary. But \pdftrailer{/ID null} will not work either, because pdfTeX sets its own /ID values regardless, what the user might have provided in \pdftrailer.
maybe \pdftrailer should also scan for /ID being set
pdfTeX calculates the non-deterministic ID values in "utils.c", function "printID". It uses the MD5 sum of the following data for the ID values: * the current time by calling function "time" (resolution is second), * the current working directory by calling "getcwd" and * the output file name.
The workaround is to overwrite the two hexadecimal string of the two ID values with a deterministic hex string of the same length. The ID key can be found at the end of the PDF file and, the dictionary is uncompressed (trailer without object compression of PDF 1.5 or the dictionary of type XRef in case of object compression.
that's indeed what i'd do as it's easy to script Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 6/26/2015 3:40 AM, Heiko Oberdiek wrote: \pdfinfo{/CreationDate null} \pdfinfo{/ModDate null}
That still leads to /CreationDate null/ModDate null in the resulting PDF which is deterministic, but not what the ISO 32000 standard recommends, as the entries are still present.
The workaround is to overwrite the two hexadecimal string of the two ID values with a deterministic hex string of the same length. The ID key can be found at the end of the PDF file and, the dictionary is uncompressed (trailer without object compression of PDF 1.5 or the dictionary of type XRef in case of object compression.
I need a pure-pdfTeX solution: version-controlled setups may have users on different operating systems (say Linux + Windows + OS X), some of which have no other scripting language than pdftex installed. If pdftex could be patched such that e.g. \pdfinfo{/CreationDate null} \pdfinfo{/ModDate null} \pdftrailer{/ID null} suppress the addition of all these dictionary entries, that would be perfect, and would not require the addition of any new primitives. It would just be an improvement to the behaviour and PDF-compatibility of two existing primitives. If nulling-entries-away worked generally for all pdfinfo entries, Icould then even decide to add e.g. \pdfinfo{/Producer null} \pdfinfo{/PTEX.Fullbanner null} such that the pdftex and Tex Live version numbers /Producer (pdfTeX-1.40.14) /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1) would disappear as well. After all, these version numbers are also kind-of timestamps, with just much lower (~annual?) resolution. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Hello, On 26.06.2015 11:48, Markus Kuhn wrote:
On 6/26/2015 3:40 AM, Heiko Oberdiek wrote: \pdfinfo{/CreationDate null} \pdfinfo{/ModDate null}
That still leads to
/CreationDate null/ModDate null
in the resulting PDF which is deterministic, but not what the ISO 32000 standard recommends, as the entries are still present.
According to the PDF specification (PDF32000_2008.pdf), Table 317 "Entries in the document information dictionary", both values are *optional* (except for ModDate, if PieceInfo is present). The semantic of a null object as value for a key means, that the entry is *not* present. Section 7.3.9 "Null Object": | Specifying the null object as the value of a dictionary entry | shall be equivalent to omitting the entry entirely. Therefore this is valid PDF. If you are unhappy with null values, then you can also specify a deterministic date and time (whatever "deterministic" means for a workflow). The format of the date string is described in the PDF specification. Example: % Set the deterministic values % ---------------------------- \year=2015 \month=6 \day=26 \time=\numexpr 16*60\relax % Format date string % ---------------------------------------- % Format: (D:YYYYMMDDHHmmSSOHH'mm) % Only the prefix "D:" and the year "YYYY" % are mandatory, see PDF specification 7.9.4 "Dates" \newcount\timehour \newcount\timemin \timehour=\time \divide\timehour by 60 \timemin=-\timehour \multiply\timemin by 60 \advance\timemin by \time \edef\pdfdate{% (D:% \the\year % four digits \ifnum\month<10 0\fi\the\month \ifnum\day<10 0\fi\the\day \ifnum\timehour<10 0\fi\the\timehour \ifnum\timemin<10 0\fi\the\timemin 00)% } % If time zone is needed, then \pdfcreationdate % can be scanned, it contains the time zone as % it is available for pdfTeX. % Set date entries % ---------------- \pdfinfo{% /CreationDate\pdfdate /ModDate\pdfdate } % Result: /CreationDate(D:20150626160000)/ModDate(D:20150626160000) Yours sincerely Heiko Oberdiek
On 6/26/2015 3:40 AM, Heiko Oberdiek wrote:
pdfTeX calculates the non-deterministic ID values in "utils.c", function "printID". It uses the MD5 sum of the following data for the ID values: * the current time by calling function "time" (resolution is second), * the current working directory by calling "getcwd" and * the output file name.
The trailer dictionary content is defined on page 43 (Table 15) of http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_200... In addition, Section 14.4 (page 551) suggests an MD5 input string, similar to the list Heiko gave above, to determine the ID value: • The current time • A string representation of the file’s location, usually a pathname • The size of the file in bytes • The values of all entries in the file’s document information dictionary But what exactly should or should not be fed into the ID-generating hash function surely depends on workflow requirements. Some may want the time in there, others now. Some may want the entire source file in there, others not. How about adding a new primitive that takes as input the string that pdfTeX will fed into MD5 in order to generate the files identifier? Then the user could override the above default choice, e.g. along the lines of \usepackage{currfile} \pdftrailerid{\today\currfilepath\input\currfilename} if I wanted the ID to be calculated based on the date, pathname and content of the source file, for example. I could then make the ID depend on whatever strings TeX has access to. In particular, I could also use \pdftrailerid{} to make it a constant, or \pdftrailerid{\jobname} to make it only depend on the filename, etc. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Hello, On 26.06.2015 11:16, Hans Hagen wrote:
On 6/26/2015 3:40 AM, Heiko Oberdiek wrote:
\pdfinfo{/ID/null}
This line was likely intended to remove the "/ID" entry. The /ID key is in the trailer dictionary. But \pdftrailer{/ID null} will not work either, because pdfTeX sets its own /ID values regardless, what the user might have provided in \pdftrailer.
maybe \pdftrailer should also scan for /ID being set
Yes, then the user can set his own values and it would solve the issue in a clean way. Yours sincerely Heiko Oberdiek
participants (4)
-
Hans Hagen
-
Heiko Oberdiek
-
Karl Berry
-
Markus Kuhn