Re: [Dev-luatex] Bug#1009196: texlive-binaries: Reproducible content of .fmt files
Hi Luigi, hi all luatex devs, here at Debian we got a bug report about reproducability of luatex format dumps. It contains a patch to make the hyphenation exception list sorted. (I attach the patch) Could you please take a look whether this is still relevant for the latest release of luatex. Thanks Norbert On Fri, 08 Apr 2022, Roland Clobus wrote:
Hello maintainers of texlive-binaries,
While working on the “reproducible builds” effort [1], I have noticed that the live image for Cinnamon in bookworm is no longer reproducible [2].
The attached patch ensures that the output of the function 'exception_strings' always uses the same order of the hyphenation exceptions. I've written the solution in C, perhaps someone more versed in lua could rewrite it more elegantly. (The lua manual says for the 'next' function: 'The order in which the indices are enumerated is not specified' [3])
With the attached patch applied, I'm able (with the help of FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH) to reproducibly rebuild the .fmt files, as created by 'fmtutil --sys --all'.
Small test case to reproduce: export FORCE_SOURCE_DATE=1 export SOURCE_DATE_EPOCH=$(date +%s) for i in `seq 1 10`; do luahbtex -ini -jobname=luahbtex -progname=luabhtex luatex.ini > /dev/null; md5sum luahbtex.*; done
With kind regards, Roland Clobus
[1]: https://wiki.debian.org/ReproducibleBuilds [2]: https://jenkins.debian.net/view/live/job/reproducible_debian_live_build_cinn... [3]: http://www.lua.org/manual/5.4/manual.html#pdf-next
-- PREINING Norbert https://www.preining.info Mercari Inc. + IFMGA Guide + TU Wien + TeX Live GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
On 4/11/2022 6:56 AM, Norbert Preining wrote:
Hi Luigi, hi all luatex devs,
here at Debian we got a bug report about reproducability of luatex format dumps. It contains a patch to make the hyphenation exception list sorted. (I attach the patch)
Could you please take a look whether this is still relevant for the latest release of luatex. it actually defeats one of the security properties of lua (which was explicitly introduced at some point: make sure that hashes have random order each run so that it's harder to retrieve sensitive data from mem)
that said, it means that as soon as something gets stored in the format
otherwise (than exceptions) one can face the same issue (although one
can work around that by sorting etc)
if you want reproducibility for some testing, mess with this instead:
#if !defined(luai_makeseed)
#include
Hi Hans, hi Roland, thanks for your answer.
it actually defeats one of the security properties of lua (which was explicitly introduced at some point: make sure that hashes have random order each run so that it's harder to retrieve sensitive data from mem)
Well, that is a good point to *not* implement the change. Roland, do you have any comments? I guess the reproducability strive is not as important as security. So if something in this way should be done, it would need to changes sort order if and only if FORCE_SOURCE_DATE=1 in the env (this is what has required for tex engines to obey SOURCE_DATE_EPOCH settings). Roland, if you have time, please adjust the patch to work within the above constraints. Best regards Norbert -- PREINING Norbert https://www.preining.info Mercari Inc. + IFMGA Guide + TU Wien + TeX Live GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
On Mon, Apr 11, 2022 at 1:01 PM Norbert Preining
Hi Hans, hi Roland,
thanks for your answer.
it actually defeats one of the security properties of lua (which was explicitly introduced at some point: make sure that hashes have random order each run so that it's harder to retrieve sensitive data from mem)
Well, that is a good point to *not* implement the change.
Roland, do you have any comments? I guess the reproducability strive is not as important as security.
So if something in this way should be done, it would need to changes sort order if and only if FORCE_SOURCE_DATE=1 in the env (this is what has required for tex engines to obey SOURCE_DATE_EPOCH settings).
not only fmt, every output could suffer from the same problem if it depends on a lua table that is not an array -- temp data, log and pdf . The format should serialize only array, or use a metatable (e.g. https://stackoverflow.com/questions/30970034/lua-in-pairs-with-same-order-as... ) Even if we hard code in some way an ordered table data structure, it's still the responsibility of the format to use it -- but then metatables are more flexible. -- luigi
not only fmt, every output could suffer from the same problem if it
If the final output (pdf) has traces of that, it might be of concern. But for now the discussion is about the fmt dump, which is independent of these items. Best regards Norbert -- PREINING Norbert https://www.preining.info Mercari Inc. + IFMGA Guide + TU Wien + TeX Live GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
Hello Hans, Norbert, Thanks for your answers. On 11/04/2022 13:01, Norbert Preining wrote:
it actually defeats one of the security properties of lua (which was explicitly introduced at some point: make sure that hashes have random order each run so that it's harder to retrieve sensitive data from mem)
Well, that is a good point to *not* implement the change.
Roland, do you have any comments? I guess the reproducability strive is not as important as security.
Well, reproducibility is *another* aspect of security; this time not for the regular environments that users will use, but for build environments. Reproducibility (as enforced by SOURCE_DATE_EPOCH) is typically enabled in an environment that generates binaries from source code for redistribution. It will guarantee that the build environment has not been tampered with, because you can (if you have made a similar build environment yourself) generate the binary files bit-for-bit identical. For a regular, production environment you should not have SOURCE_DATE_EPOCH set. Other programming languages also have solved the security risks associated with the randomness of the hashes and reproducibility, see [1]. For Perl, the hashes can be de-randomized with PERL_HASH_SEED. Python uses PYTHONHASHSEED. For Lua an environment variable LUA_HASH_SEED could be introduced, or per default the value of SOURCE_DATE_EPOCH (if set) instead of time(NULL) could be used to seed the hashes. The texlive-binaries in Debian contain an embedded copy of Lua 5.3. The Lua 5.4 version of luai_makeseed is more complex, see [2]. I'll write a feature request for Lua later, that is out-of-scope for this scenario.
So if something in this way should be done, it would need to changes sort order if and only if FORCE_SOURCE_DATE=1 in the env (this is what has required for tex engines to obey SOURCE_DATE_EPOCH settings).
Roland, if you have time, please adjust the patch to work within the above constraints.
Ack. Thanks for the pointer to luai_makeseed, that was some missing information that I needed. I'll post an updated patch soon (most probably much smaller and more elegant). As written above, the hash seed will be de-randomized only when both FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set. With kind regards, Roland Clobus [1] https://reproducible-builds.org/docs/stable-outputs/ [2] https://sources.debian.org/src/lua5.4/5.4.4-1/src/lstate.c/?hl=73#L73
On 4/11/2022 4:34 PM, Roland Clobus wrote:
The texlive-binaries in Debian contain an embedded copy of Lua 5.3. The Lua 5.4 version of luai_makeseed is more complex, see [2]. I'll write a feature request for Lua later, that is out-of-scope for this scenario. fyi: it is unlikely that luatex will move to 5.4 because it might break exisiting code and/or introduce incompatibilties (so we assume 5.3 for now)
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Mon, Apr 11, 2022 at 5:52 PM Roland Clobus
Hello Hans, Norbert,
Thanks for your answers.
On 11/04/2022 13:01, Norbert Preining wrote:
it actually defeats one of the security properties of lua (which was explicitly introduced at some point: make sure that hashes have random order each run so that it's harder to retrieve sensitive data from mem)
Well, that is a good point to *not* implement the change.
Roland, do you have any comments? I guess the reproducability strive is not as important as security.
Well, reproducibility is *another* aspect of security; this time not for the regular environments that users will use, but for build environments.
Reproducibility (as enforced by SOURCE_DATE_EPOCH) is typically enabled in an environment that generates binaries from source code for redistribution. It will guarantee that the build environment has not been tampered with, because you can (if you have made a similar build environment yourself) generate the binary files bit-for-bit identical. For a regular, production environment you should not have SOURCE_DATE_EPOCH set.
Other programming languages also have solved the security risks associated with the randomness of the hashes and reproducibility, see [1]. For Perl, the hashes can be de-randomized with PERL_HASH_SEED. Python uses PYTHONHASHSEED. For Lua an environment variable LUA_HASH_SEED could be introduced, or per default the value of SOURCE_DATE_EPOCH (if set) instead of time(NULL) could be used to seed the hashes.
The texlive-binaries in Debian contain an embedded copy of Lua 5.3. The Lua 5.4 version of luai_makeseed is more complex, see [2]. I'll write a feature request for Lua later, that is out-of-scope for this scenario.
So if something in this way should be done, it would need to changes sort order if and only if FORCE_SOURCE_DATE=1 in the env (this is what has required for tex engines to obey SOURCE_DATE_EPOCH settings).
Roland, if you have time, please adjust the patch to work within the above constraints.
Ack. Thanks for the pointer to luai_makeseed, that was some missing information that I needed. I'll post an updated patch soon (most probably much smaller and more elegant). As written above, the hash seed will be de-randomized only when both FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set.
I am perplexed, perhaps I misunderstood something. The distinction among "the regular environments that users will use" and the "build environments" seem to be done at runtime for the same binary by setting an env. variable -- but in this case a malicious "regular" user could also set LUA_HASH_SEED, breaking the security property. In this *specific* case, one can check by sorting -- as done by the patch: #!/bin/sh export FORCE_SOURCE_DATE=1 export SOURCE_DATE_EPOCH=$(date +%s) for i in `seq 1 10`; do luahbtex -ini -jobname=luahbtex -progname=luabhtex luatex.ini 1>/dev/null; gunzip -d -c luahbtex.fmt|tail -1 |xxd -i |perl -pe 's{,\s*}{\n}g;s{^\s*}{}g;'|sort|md5sum ; md5sum luahbtex.log; done because *in this case* two distinct fmt differ only at the last line -- but perhaps choosing another format (lualatex) could make more sense. -- luigi
Hello luigi and others, On 11/04/2022 20:28, luigi scarso wrote: ...
I am perplexed, perhaps I misunderstood something. The distinction among "the regular environments that users will use" and the "build environments" seem to be done at runtime for the same binary by setting an env. variable -- but in this case a malicious "regular" user could also set LUA_HASH_SEED, breaking the security property.
That's why the documentation for such potentially security-breaking features mention how they are to be used. One is typically not expected to set the seed values, but if you do set them, it's your own responsibility. E.g. Python's man page: <quote> PYTHONHASHSEED If this variable is set to "random", a random value is used to seed the hashes of str and bytes objects. If PYTHONHASHSEED is set to an integer value, it is used as a fixed seed for generating the hash() of the types covered by the hash randomization. Its purpose is to allow repeatable hashing, such as for selftests for the in‐ terpreter itself, or to allow a cluster of python processes to share hash values. The integer must be a decimal number in the range [0,4294967295]. Specifying the value 0 will disable hash ran‐ domization. </quote> Perl has a more severe disclaimer: https://perldoc.perl.org/perlrun#PERL_HASH_SEED <quote> PLEASE NOTE: The hash seed is sensitive information. Hashes are randomized to protect against local and remote attacks against Perl code. By manually setting a seed, this protection may be partially or completely lost. </quote>
In this *specific* case, one can check by sorting -- as done by the patch:
#!/bin/sh export FORCE_SOURCE_DATE=1 export SOURCE_DATE_EPOCH=$(date +%s) for i in `seq 1 10`; do luahbtex -ini -jobname=luahbtex -progname=luabhtex luatex.ini 1>/dev/null; gunzip -d -c luahbtex.fmt|tail -1 |xxd -i |perl -pe 's{,\s*}{\n}g;s{^\s*}{}g;'|sort|md5sum ; md5sum luahbtex.log; done
This checks the whole file, but the issue is that the order of the bytes is different only at a specific location in the file: the list of hyphenation exceptions. Only that specific part needs a special handling. For completeness, this issue is present in at least 3 .fmt files. Each is generated by 'fmtutil --sys --all', which in turn does: luahbtex -ini -jobname=luahbtex -progname=luahbtex luatex.ini luatex -ini -jobname=dviluatex -progname=dviluatex dviluatex.ini luatex -ini -jobname=luatex -progname=luatex luatex.ini In the case of texlive: setting *both* FORCE_SOURCE_DATE and SOURCE_DATE_EPOCH will be IHMO sufficiently special to allow disabling the random hashing seed. I'll follow-up soon with an updated patch. With kind regards, Roland
Hello list, On 12/04/2022 08:44, Roland Clobus wrote:
I'll follow-up soon with an updated patch.
As discussed, I've updated the patch. For Lua-based TeX binaries, only when FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set, this will initialise the Lua seed to the value of SOURCE_DATE_EPOCH instead of a random value. With this patch, the .fmt files can be generated bit-for-bit identical. Regarding the patch: * This patch is intended only for Lua 5.3 that is embedded in texlive-binaries * A re-definition of `luai_makeseed` is unfortunately not sufficient for Lua 5.3, for 5.4.4 and later it would be. [1] * I've added no validation for the content of SOURCE_DATE_EPOCH: ** 1) That happens in other code locations already ** 2) Even if the value would be incorrect, the Lua seed will still be de-randomized * Do you want some comment lines? * The sorting from by previous patch is no longer required. Only lstate.c needs to be modified. With kind regards, Roland Clobus PS: If you later intend to upgrade to another version of Lua, the fixed seed value can help you in automated tests to see different behaviour due to the upgrade. [1] https://github.com/lua/lua/commit/97e394ba1805fbe394a5704de660403901559e54
On Tue, Apr 19, 2022 at 9:16 AM Roland Clobus
Hello list,
On 12/04/2022 08:44, Roland Clobus wrote:
I'll follow-up soon with an updated patch.
As discussed, I've updated the patch.
For Lua-based TeX binaries, only when FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set, this will initialise the Lua seed to the value of SOURCE_DATE_EPOCH instead of a random value. With this patch, the .fmt files can be generated bit-for-bit identical.
Regarding the patch: * This patch is intended only for Lua 5.3 that is embedded in texlive-binaries * A re-definition of `luai_makeseed` is unfortunately not sufficient for Lua 5.3, for 5.4.4 and later it would be. [1] * I've added no validation for the content of SOURCE_DATE_EPOCH: ** 1) That happens in other code locations already ** 2) Even if the value would be incorrect, the Lua seed will still be de-randomized * Do you want some comment lines? * The sorting from by previous patch is no longer required. Only lstate.c needs to be modified.
With kind regards, Roland Clobus
PS: If you later intend to upgrade to another version of Lua, the fixed seed value can help you in automated tests to see different behaviour due to the upgrade.
[1] https://github.com/lua/lua/commit/97e394ba1805fbe394a5704de660403901559e54
Thank you very much for your patch, I will check it this weekend. -- luigi
Hello list, On 19/04/2022 09:52, luigi scarso wrote:
Thank you very much for your patch, I will check it this weekend.
Another note: While preparing for a generic change request for Lua, I found a mail by Hans Hagen [1], stating that all cases have been found in luatex. Sorting the table (as in my original patch) is also a solution, but my proposed patch in lstate.c will fix the root cause. I would rather fix the root cause. If you prefer the sorting patch, I'll adapt it to activate only when FORCE_SOURCE_DATE=1 and SOURCE_DATE_EPOCH are set. With kind regards, Roland Clobus [1] http://lua-users.org/lists/lua-l/2014-07/msg00564.html
Hello luigi, list,
On 19/04/2022 09:52, luigi scarso wrote: >> Thank you very much for your patch, I will check it this weekend. Have you found the time already to review my patch? [1]
With kind regards, Roland Clobus [1] https://mailman.ntg.nl/pipermail/dev-luatex/2022-April/006659.html
On Wed, May 4, 2022 at 3:09 PM Roland Clobus
Hello luigi, list,
On 19/04/2022 09:52, luigi scarso wrote: >> Thank you very much for your patch, I will check it this weekend. Have you found the time already to review my patch? [1]
Yes, Hans and I are discussing. If possible, I would like to use a --reproducible switch at the command line. -- luigi
On 04/05/2022 15:16, luigi scarso wrote:
On Wed, May 4, 2022 at 3:09 PM Roland Clobus
On 19/04/2022 09:52, luigi scarso wrote:
Thank you very much for your patch, I will check it this weekend. Have you found the time already to review my patch? [1]
Yes, Hans and I are discussing. If possible, I would like to use a --reproducible switch at the command line.
Adding a commandline argument is sometimes proposed by the development teams, instead of using SOURCE_DATE_EPOCH. I would rather suggest to use SOURCE_DATE_EPOCH, which is already in the code base, instead of adding a new code path. If you find the time, please read the documentation on SOURCE_DATE_EPOCH [1] and the page that mentions a checklist [2]. The short summary: SOURCE_DATE_EPOCH has been standardized and is primarily intended to be used by rebuilders of the binaries, not the developers or end-users. In the past, when SOURCE_DATE_EPOCH was getting established, texlive additionally added FORCE_SOURCE_DATE=1. Nowadays, if it can be avoided, I would recommend to use only SOURCE_DATE_EPOCH. See [3] for all uses of FORCE_SOURCE_DATE_ in Debian. As you can see, it is mainly used in several tests to ensure that packages have output that can be compared against a reference. With kind regards, Roland Clobus [1] https://reproducible-builds.org/docs/source-date-epoch/ [2] https://wiki.debian.org/ReproducibleBuilds/StandardEnvironmentVariables#Chec... [3] https://codesearch.debian.net/search?q=FORCE_SOURCE_DATE&literal=0
participants (4)
-
Hans Hagen
-
luigi scarso
-
Norbert Preining
-
Roland Clobus