Dump sharing - redundant copying/allocations
Hello, when NO_DUMP_SHARE is not defined and the system is Little endian, then for the possibility of sharing format files between architectures all dumped multi-byte valules are byte swapped by the function "swap_items" in "tex/texfileio.c". The origin of the function is in Web2C's "texmfmp.c" and it seems that it was added to LuaTeX in the version 1.08 when the source was converted from CWeb (or am I reading the history wrong?). However, there are two differences between Web2C's and LuaTeX's "swap_items": 1) LuaTeX also supports 12 byte values. and more importantly: 2) LuaTeX essentially does this: - allocate temporary array - copy input to temporary array - [common code with web2c] - copy temporary array back to input to serve as output - free temporary array This all seems redundant and causes many small allocations (~350K allocations for a ~800K format file), because most allocations are of only 4 bytes. Curiously the gcc -O2 optimizer doesn't catch this even though it is a static function (and changing xmalloc/xfree to the "intrinsic" malloc/free doesn't help it). Maybe the possible unsigned int overflow prevents the optimization? Or am I missing some side effect/purpose of the copying/allocating? See my proposal below. (Note that in the LuaTeX repository --disable-dump-share is the default, while it isn't in TeX Live, I think.) Michal Vlasák --- a/tex/texfileio.c +++ b/tex/texfileio.c @@ -1125,13 +1125,9 @@ static gzFile gz_fmtfile = NULL; */ -static void swap_items(char *pp, int nitems, int size) +static void swap_items(char *p, int nitems, int size) { char temp; - unsigned total = (unsigned) (nitems * size); - char *q = xmalloc(total); - char *p = q; - memcpy(p,pp,total); /*tex Since `size' does not change, we can write a while loop for each case, @@ -1201,8 +1197,6 @@ static void swap_items(char *pp, int nitems, int size) default: FATAL1("Can't swap a %d-byte item for (un)dumping", size); } - memcpy(pp,q,total); - xfree(q); } #endif
Hi,
On 20 Jul 2021, at 18:01, Michal Vlasák
wrote: Hello,
when NO_DUMP_SHARE is not defined and the system is Little endian, then for the possibility of sharing format files between architectures all dumped multi-byte valules are byte swapped by the function "swap_items" in "tex/texfileio.c".
The origin of the function is in Web2C's "texmfmp.c" and it seems that it was added to LuaTeX in the version 1.08 when the source was converted from CWeb (or am I reading the history wrong?).
That code has been around since 2010 (luatex 0.60-ish). I can’t remember exactly why I did that odd copying to a temp array, but I do remember that that particular function was quite problematic w.r.t. endianness and (cross)compiler issues (read: bugs, as in “internal compiler error"). Best wishes, Taco — Taco Hoekwater E: taco@bittext.nl genderfluid (all pronouns)
On Tue Jul 20, 2021 at 7:56 PM CEST, Taco Hoekwater wrote:
Hi,
On 20 Jul 2021, at 18:01, Michal Vlasák
wrote: Hello,
when NO_DUMP_SHARE is not defined and the system is Little endian, then for the possibility of sharing format files between architectures all dumped multi-byte valules are byte swapped by the function "swap_items" in "tex/texfileio.c".
The origin of the function is in Web2C's "texmfmp.c" and it seems that it was added to LuaTeX in the version 1.08 when the source was converted from CWeb (or am I reading the history wrong?).
That code has been around since 2010 (luatex 0.60-ish). I can’t remember exactly why I did that odd copying to a temp array, but I do remember that that particular function was quite problematic w.r.t. endianness and (cross)compiler issues (read: bugs, as in “internal compiler error").
Thank you for your valuable insight Taco. My thinking was skewed because I read the history incorrectly, thinking it was more of a mistake that slipped into a big commit. Anyways despite the number of allocations the performance is probably fine, since nobody complained, yet. I noticed it by chance while reading the code. Best regards, Michal Vlasák
On 20 Jul 2021, at 21:51, Michal Vlasák
wrote: On Tue Jul 20, 2021 at 7:56 PM CEST, Taco Hoekwater wrote:
That code has been around since 2010 (luatex 0.60-ish). I can’t remember exactly why I did that odd copying to a temp array, but I do remember that that particular function was quite problematic w.r.t. endianness and (cross)compiler issues (read: bugs, as in “internal compiler error").
Thank you for your valuable insight Taco. My thinking was skewed because I read the history incorrectly, thinking it was more of a mistake that slipped into a big commit.
It was definitely on purpose at the time. But that doesn’t mean you are wrong: your patch should probably be applied. More than a decade later, the original compiler problems should be fixed by now (one would hope so!). There are massive gaps in the luatex part of the svn history of tex-live, so I don’t think you read the history wrong either, it is just that big chunks of luatex’s development have not taken place in the texlive repository. Best wishes, Taco — Taco Hoekwater E: taco@bittext.nl genderfluid (all pronouns)
On 7/20/2021 9:51 PM, Michal Vlasák wrote:
On Tue Jul 20, 2021 at 7:56 PM CEST, Taco Hoekwater wrote:
Hi,
On 20 Jul 2021, at 18:01, Michal Vlasák
wrote: Hello,
when NO_DUMP_SHARE is not defined and the system is Little endian, then for the possibility of sharing format files between architectures all dumped multi-byte valules are byte swapped by the function "swap_items" in "tex/texfileio.c".
The origin of the function is in Web2C's "texmfmp.c" and it seems that it was added to LuaTeX in the version 1.08 when the source was converted from CWeb (or am I reading the history wrong?).
That code has been around since 2010 (luatex 0.60-ish). I can’t remember exactly why I did that odd copying to a temp array, but I do remember that that particular function was quite problematic w.r.t. endianness and (cross)compiler issues (read: bugs, as in “internal compiler error").
Thank you for your valuable insight Taco. My thinking was skewed because I read the history incorrectly, thinking it was more of a mistake that slipped into a big commit.
Anyways despite the number of allocations the performance is probably fine, since nobody complained, yet. I noticed it by chance while reading the code. just afew remarks:
- dump sharing in luatex makes no sense, also because lua byte code can be stored and that is not portable .. for that reason byte swapping was removed at some later point in the project - byte swapping introduces overhead and was happening for the majority of users (intel), but the code was/is still there - we store a format version number in the format so that a format file will not be loaded when there is a mismatch (that was added later) - format gz compression was introduced already early to keep the format small and irr that needed some of these copying tweaks - as you mention, the allocation overhead is small, which is definitely true compared to byte swapping and all the decompression calls in between (loading the file database at startup probably takes more time, and definitely in the early days was quite noticeable, more than format loading); the level 3 compression gav ethe best trade-off - the mentioned 'internal compiler error' mentioned by Taco rings a bell, it's a reason why often saving/loading an 'int' goes via a variable because compilers would optimize in a way that dumping variables (ints) in more complex data structures gave issues - there are a few more places where using redundant temp vars are used because during some operations memory can grow which can makes pointers already set invalid, some of those have been sorted out differently in the meantime) - talking of performance, one of the interesting things in the beginning of development was that we noticed different (incremental) versions to perform differently; for instance when math was opened up the machinery became real slow, as if we crossed some boundary, (compilation order of specific code modules mattered too); but when i then updated my laptop it was fast again, not so much because of fhe faster cpu but because the cpu cache was larger; compiler optimization also kind of interfered (at that time ideas, experiments and binaries came and went on a daily basis, we had quite some fun) - if performance is of concern, we also noticed (later one when luajit enteres the scenary) that the settings for lua hashing matters, and that luajit had pretty bad heuristics (tuned for url, we published about that) so we used a different hashing there ... Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Wed Jul 21, 2021 at 9:09 AM CEST, Hans Hagen wrote:
just afew remarks:
- dump sharing in luatex makes no sense, also because lua byte code can be stored and that is not portable .. for that reason byte swapping was removed at some later point in the project
Funnily enough, I looked at the code exactly because of this. As it turns out, in TeX Live after recent problems with format sharing across 32-bit Windows and 64-bit Linux, there is now an effort to ensure the portability of formats: https://git.texlive.info/texlive/tree/Master/tlpkg/bin/tl-check-fmtshare Of course the issue of unportable bytecode came up. As far as I know three LuaTeX formats store it in format files: ConTeXt (wasn't and probably won't be checked by the script and users alike), OpTeX and minim (recent format, not even genereted in TeX Live). I evaluated the possibility of byte swapping in the Lua (un)dumping, by introducing a patch in TeX Live, but I don't think it is worth: - I personally wouldn't encourage _more_ format sharing between OS's and architectures in the future. - The types used by Lua (long long, double, int) may not even be portable anyways. - The "right" approach for portable dumping doesn't fit the current architecture. (https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html)
- byte swapping introduces overhead and was happening for the majority of users (intel), but the code was/is still there
I agree that if something the more common Little endian would have been a better choice.
- as you mention, the allocation overhead is small, which is definitely true compared to byte swapping and all the decompression calls in between (loading the file database at startup probably takes more time, and definitely in the early days was quite noticeable, more than format loading); the level 3 compression gav ethe best trade-off
Interestingly I didn't find any really measurable slowdown from the byte swapping, possibly because the entire function was turned into a lot of SIMD instructions. (But to be fair I didn't test with a huge format, and usually compile LuaTeX with the byte swapping disabled anyways.)
- the mentioned 'internal compiler error' mentioned by Taco rings a bell, it's a reason why often saving/loading an 'int' goes via a variable because compilers would optimize in a way that dumping variables (ints) in more complex data structures gave issues
- there are a few more places where using redundant temp vars are used because during some operations memory can grow which can makes pointers already set invalid, some of those have been sorted out differently in the meantime)
Very interesting, hopefully the situation improved since.
- talking of performance, one of the interesting things in the beginning of development was that we noticed different (incremental) versions to perform differently; for instance when math was opened up the machinery became real slow, as if we crossed some boundary, (compilation order of specific code modules mattered too); but when i then updated my laptop it was fast again, not so much because of fhe faster cpu but because the cpu cache was larger; compiler optimization also kind of interfered (at that time ideas, experiments and binaries came and went on a daily basis, we had quite some fun)
- if performance is of concern, we also noticed (later one when luajit enteres the scenary) that the settings for lua hashing matters, and that luajit had pretty bad heuristics (tuned for url, we published about that) so we used a different hashing there ...
Thank you for your insights and caring about performance! Michal Vlasák
On 7/21/2021 1:58 PM, Michal Vlasák wrote:
On Wed Jul 21, 2021 at 9:09 AM CEST, Hans Hagen wrote:
just afew remarks:
- dump sharing in luatex makes no sense, also because lua byte code can be stored and that is not portable .. for that reason byte swapping was removed at some later point in the project
Funnily enough, I looked at the code exactly because of this. As it turns out, in TeX Live after recent problems with format sharing across 32-bit Windows and 64-bit Linux, there is now an effort to ensure the portability of formats:
https://git.texlive.info/texlive/tree/Master/tlpkg/bin/tl-check-fmtshare
Of course the issue of unportable bytecode came up. As far as I know three LuaTeX formats store it in format files: ConTeXt (wasn't and probably won't be checked by the script and users alike), OpTeX and minim (recent format, not even genereted in TeX Live).
context always managed its own format generation also because we operate on an engine axis too and one never calls context by its format stub
I evaluated the possibility of byte swapping in the Lua (un)dumping, by introducing a patch in TeX Live, but I don't think it is worth:
- I personally wouldn't encourage _more_ format sharing between OS's and architectures in the future. - The types used by Lua (long long, double, int) may not even be portable anyways. - The "right" approach for portable dumping doesn't fit the current architecture. (https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html)
even if one would handle bytes in the lua bytecode, the bytecode itself is not portable (and i'm not even sure how luajit stuff fits in because luajit is even more platform specific
- byte swapping introduces overhead and was happening for the majority of users (intel), but the code was/is still there
I agree that if something the more common Little endian would have been a better choice.
the sharing made sense in the time when one ran tex from a dvd in which case all binaries shared the same precooked formats (or on networkshares servingb multiple architectures) but afaik running from dvd was dropped and hardly anyone runs multiple platforms from one share (those who do can probably figure out some trick)
- as you mention, the allocation overhead is small, which is definitely true compared to byte swapping and all the decompression calls in between (loading the file database at startup probably takes more time, and definitely in the early days was quite noticeable, more than format loading); the level 3 compression gav ethe best trade-off
Interestingly I didn't find any really measurable slowdown from the byte swapping, possibly because the entire function was turned into a lot of SIMD instructions. (But to be fair I didn't test with a huge format, and usually compile LuaTeX with the byte swapping disabled anyways.)
it is (or at least) was slower with the native microsoft compiler (win 32 bins from akira) because that compiler is less agressive in some optimzations (we could deduce it plays safe in some areas of memory casting especially combined with the gz decompression)
- the mentioned 'internal compiler error' mentioned by Taco rings a bell, it's a reason why often saving/loading an 'int' goes via a variable because compilers would optimize in a way that dumping variables (ints) in more complex data structures gave issues
- there are a few more places where using redundant temp vars are used because during some operations memory can grow which can makes pointers already set invalid, some of those have been sorted out differently in the meantime)
Very interesting, hopefully the situation improved since.
- talking of performance, one of the interesting things in the beginning of development was that we noticed different (incremental) versions to perform differently; for instance when math was opened up the machinery became real slow, as if we crossed some boundary, (compilation order of specific code modules mattered too); but when i then updated my laptop it was fast again, not so much because of fhe faster cpu but because the cpu cache was larger; compiler optimization also kind of interfered (at that time ideas, experiments and binaries came and went on a daily basis, we had quite some fun)
- if performance is of concern, we also noticed (later one when luajit enteres the scenary) that the settings for lua hashing matters, and that luajit had pretty bad heuristics (tuned for url, we published about that) so we used a different hashing there ...
Thank you for your insights and caring about performance! one observation is that using macros instead of functions for
i think there are still a few places performance makes little sense in a program like tex where one jumps over memory space all the time (compilers are quite okay in optimizing), but there can be differences between versions of e.g. gcc in general, loss of performance in a tex engine is more due to the way macros are composed (or user styles for that matter) another one is the performance of the console, i.e. kind of font, buffer, refresh delays defaults (i noticed that linux has large delays so that's the fastest, the new windows terminal is also fast) .. now that one is really measureable .. just try to run with piping the log to a file (all understandable) .. squeezing microseconds out of the binary can easily be nilled that way Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Wed Jul 21, 2021 at 3:57 PM CEST, Hans Hagen wrote:
even if one would handle bytes in the lua bytecode, the bytecode itself is not portable (and i'm not even sure how luajit stuff fits in because luajit is even more platform specific
LuaJIT is actually really nice in this regard: "The generated bytecode is portable and can be loaded on any architecture that LuaJIT supports, independent of word size or endianess. However the bytecode compatibility versions must match."
one observation is that using macros instead of functions for performance makes little sense in a program like tex where one jumps over memory space all the time (compilers are quite okay in optimizing), but there can be differences between versions of e.g. gcc
I think that modern compilers are good with inlining, one can get more espcially when functions are marked static. So I incline towards functions rather than macros.
in general, loss of performance in a tex engine is more due to the way macros are composed (or user styles for that matter)
another one is the performance of the console, i.e. kind of font, buffer, refresh delays defaults (i noticed that linux has large delays so that's the fastest, the new windows terminal is also fast) .. now that one is really measureable .. just try to run with piping the log to a file (all understandable) .. squeezing microseconds out of the binary can easily be nilled that way
Yeah, you are right, even for 18 lines of console output I mesaure more noticable difference than with the mallocs and byte swapping. Thanks, Michal
On 7/21/2021 4:08 PM, Michal Vlasák wrote:
On Wed Jul 21, 2021 at 3:57 PM CEST, Hans Hagen wrote:
even if one would handle bytes in the lua bytecode, the bytecode itself is not portable (and i'm not even sure how luajit stuff fits in because luajit is even more platform specific
LuaJIT is actually really nice in this regard:
"The generated bytecode is portable and can be loaded on any architecture that LuaJIT supports, independent of word size or endianess. However the bytecode compatibility versions must match."
Ok, btw, these bytecode compatibility versions are not guaranteed the same within intermediate updates (so for instance during the 5.4 dev stage they changed .. i actually took care of that but didn't want to patch the official code - with a sub number - any more so i dropped that)
one observation is that using macros instead of functions for performance makes little sense in a program like tex where one jumps over memory space all the time (compilers are quite okay in optimizing), but there can be differences between versions of e.g. gcc
I think that modern compilers are good with inlining, one can get more espcially when functions are marked static. So I incline towards functions rather than macros.
also, local optimization is better
in general, loss of performance in a tex engine is more due to the way macros are composed (or user styles for that matter)
another one is the performance of the console, i.e. kind of font, buffer, refresh delays defaults (i noticed that linux has large delays so that's the fastest, the new windows terminal is also fast) .. now that one is really measureable .. just try to run with piping the log to a file (all understandable) .. squeezing microseconds out of the binary can easily be nilled that way
Yeah, you are right, even for 18 lines of console output I mesaure more noticable difference than with the mallocs and byte swapping. it has to do with the fact that tex outputs on a char by char basis with different criteria for log and console, and most consoles accumulate some before flushing, sometimes upto 200 ms
the old windows console output per-char so that one was hurt most, but there were plenty ways around that; on osx fancy font features in a console could also work out bad Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 2021-07-21 at 15:57:00 +0200, Hans Hagen wrote:
the sharing made sense in the time when one ran tex from a dvd in which case all binaries shared the same precooked formats (or on networkshares servingb multiple architectures) but afaik running from dvd was dropped and hardly anyone runs multiple platforms from one share (those who do can probably figure out some trick)
It still makes sense. It's true that we currently can't provide a live system on a DVD but it's still possible to install a live system on USB sticks. User groups can't provide USB sticks because they are too expensive. Installing TL on a server is certainly much more common than you think. I'm doing this for years at work. And some of my workflows heavily depend on TeX Live's portability. So whenever possible, please don't give up this great feature. Regards, Reinhard -- ------------------------------------------------------------------ Reinhard Kotucha Phone: +49-511-3373112 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ------------------------------------------------------------------
Hello, IMHO, the TeX engine can save a flag about Endianity to the format. When a TeX engine reads such a format then it can check the current Endianity with the saved one and do swapping only if they are different. But this idea was'nt implemented: all formats use non-Intel Endianity (by decision of develpers), so swapping are processed very often. Moreover, saving lua bytecode to the format does not support different architectures (It was mentioned in this thread too). IMHO, the classical message "I am stymied" should be sufficient when a TeX engine reads a format generated at different architecture. Petr Olsak On 7/20/21 6:01 PM, Michal Vlasák wrote:
Hello,
when NO_DUMP_SHARE is not defined and the system is Little endian, then for the possibility of sharing format files between architectures all dumped multi-byte valules are byte swapped by the function "swap_items" in "tex/texfileio.c".
The origin of the function is in Web2C's "texmfmp.c" and it seems that it was added to LuaTeX in the version 1.08 when the source was converted from CWeb (or am I reading the history wrong?).
However, there are two differences between Web2C's and LuaTeX's "swap_items":
1) LuaTeX also supports 12 byte values.
and more importantly:
2) LuaTeX essentially does this: - allocate temporary array - copy input to temporary array - [common code with web2c] - copy temporary array back to input to serve as output - free temporary array
This all seems redundant and causes many small allocations (~350K allocations for a ~800K format file), because most allocations are of only 4 bytes.
Curiously the gcc -O2 optimizer doesn't catch this even though it is a static function (and changing xmalloc/xfree to the "intrinsic" malloc/free doesn't help it). Maybe the possible unsigned int overflow prevents the optimization? Or am I missing some side effect/purpose of the copying/allocating?
See my proposal below.
(Note that in the LuaTeX repository --disable-dump-share is the default, while it isn't in TeX Live, I think.)
Michal Vlasák
--- a/tex/texfileio.c +++ b/tex/texfileio.c @@ -1125,13 +1125,9 @@ static gzFile gz_fmtfile = NULL;
*/
-static void swap_items(char *pp, int nitems, int size) +static void swap_items(char *p, int nitems, int size) { char temp; - unsigned total = (unsigned) (nitems * size); - char *q = xmalloc(total); - char *p = q; - memcpy(p,pp,total); /*tex
Since `size' does not change, we can write a while loop for each case, @@ -1201,8 +1197,6 @@ static void swap_items(char *pp, int nitems, int size) default: FATAL1("Can't swap a %d-byte item for (un)dumping", size); } - memcpy(pp,q,total); - xfree(q); } #endif
_______________________________________________ dev-luatex mailing list dev-luatex@ntg.nl https://mailman.ntg.nl/mailman/listinfo/dev-luatex
On 7/21/2021 3:13 PM, Petr Olsak wrote:
Hello,
IMHO, the TeX engine can save a flag about Endianity to the format. When a TeX engine reads such a format then it can check the current Endianity with the saved one and do swapping only if they are different. But this idea was'nt implemented: all formats use non-Intel Endianity (by decision of develpers), so swapping are processed very often.
Moreover, saving lua bytecode to the format does not support different architectures (It was mentioned in this thread too). IMHO, the classical message "I am stymied" should be sufficient when a TeX engine reads a format generated at different architecture. luatex already has additional checking and would not load a format ... a saved version number for instance would come out differewnt in a non matching endian so as far we know we're okay
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Wed Jul 21, 2021 at 3:13 PM CEST, Petr Olsak wrote:
IMHO, the TeX engine can save a flag about Endianity to the format. When a TeX engine reads such a format then it can check the current Endianity with the saved one and do swapping only if they are different. But this idea was'nt implemented: all formats use non-Intel Endianity (by decision of develpers), so swapping are processed very often.
Hans was quicker with his reply, but here is how I would put it: The byte swapping on Little Endian isn't mandatory. One can compile with -DNO_DUMP_SHARE.
Moreover, saving lua bytecode to the format does not support different architectures (It was mentioned in this thread too). IMHO, the classical message "I am stymied" should be sufficient when a TeX engine reads a format generated at different architecture.
This is exactly what I get when loading Big Endian format on Little Endian: (Fatal format file error; I'm stymied) The BE/LE check is actually done as a side effect of checking the magic number (0x57325458, the first thing that gets checked). Michal Vlasák
participants (5)
-
Hans Hagen
-
Michal Vlasák
-
Petr Olsak
-
Reinhard Kotucha
-
Taco Hoekwater