Best way to create a large number of documents from database
Hi, I have been asked to create a few thousand PDF documents from a CSV "database" today (which I can easily transform into any other form, like XML or a lua table or TeX definitions or whatever). Generating a few thousand pages would be straightforward, but I'm sure there are some clever ways to handle this scenario as well, I'm just not aware of them :) One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines. What's the best approach with the existing functionality? I would be more than grateful for any hints. Thank you very much, Mojca
On 16 Apr 2020, at 11:12, Mojca Miklavec
wrote: Hi,
I have been asked to create a few thousand PDF documents from a CSV "database" today (which I can easily transform into any other form, like XML or a lua table or TeX definitions or whatever).
Generating a few thousand pages would be straightforward, but I'm sure there are some clever ways to handle this scenario as well, I'm just not aware of them :)
In CPU cycles, the fastest way is to do a single context —once run generating all the pages as a single document, then using mutool merge to split it into separate documents using a (shell) loop. Starting up mutool is much faster than starting context, even with lmtx.
One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines.
If you want to go this route (and you may have to if not each record fits exactly within a single page), browse back a day or so in the mailing list archive for Gerben’s question about “Using command line values in a TeX document; writing a script?" The replies offer various options using either lua or tex code to get at user-supplied arguments from the commandline. Best wishes, Taco
On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
I have been asked to create a few thousand PDF documents from a CSV "database" today
In CPU cycles, the fastest way is to do a single context —once run generating all the pages as a single document, then using mutool merge to split it into separate documents using a (shell) loop.
Just to make it clear: I don't really need to optimize on the CPU end, as the bottleneck is on the other side of the keyboard, so as long as the CPU can process 5k pages today, I'm fine with it :) :) :)
One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines.
If you want to go this route (and you may have to if not each record fits exactly within a single page),
I do have one page per document. The more annoying part is having strange document names that need more attention when mapping page number -> name (I'm not saying this is not doable).
browse back a day or so in the mailing list archive for Gerben’s question about
“Using command line values in a TeX document; writing a script?"
Thanks a lot for the pointer. I didn't have that much time to read through all the emails recently, I only noticed that he was super actively working on some metapost stuff, I wasn't paying attention to this.
The replies offer various options using either lua or tex code to get at user-supplied arguments from the commandline.
Let me see what I come up with, I'm stil fiddling with data & layout at the moment :) Mojca
On 4/16/2020 4:38 PM, Mojca Miklavec wrote:
On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
I have been asked to create a few thousand PDF documents from a CSV "database" today
In CPU cycles, the fastest way is to do a single context —once run generating all the pages as a single document, then using mutool merge to split it into separate documents using a (shell) loop.
Just to make it clear: I don't really need to optimize on the CPU end, as the bottleneck is on the other side of the keyboard, so as long as the CPU can process 5k pages today, I'm fine with it :) :) :)
5K is nothing ... so that will work
One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines.
If you want to go this route (and you may have to if not each record fits exactly within a single page),
I do have one page per document. The more annoying part is having strange document names that need more attention when mapping page number -> name (I'm not saying this is not doable).
so, don't make files: - write a tex file foo.tex - process it: context --batch --result=1 --once foo etc ... so, use --result for the target name and use the same input name (I won't bother you with the template system in context that no one knows of.) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
A relatively simple way is to use a templating system such as jinja2 and
iterate over a mkiv template.
Calling context with subprocess and you got the result.
Le jeu. 16 avr. 2020 à 15:52, Hans Hagen
On 4/16/2020 4:38 PM, Mojca Miklavec wrote:
On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
I have been asked to create a few thousand PDF documents from a CSV "database" today
In CPU cycles, the fastest way is to do a single context —once run generating all the pages as a single document, then using mutool merge to split it into separate documents using a (shell) loop.
Just to make it clear: I don't really need to optimize on the CPU end, as the bottleneck is on the other side of the keyboard, so as long as the CPU can process 5k pages today, I'm fine with it :) :) :)
5K is nothing ... so that will work
One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines.
If you want to go this route (and you may have to if not each record fits exactly within a single page),
I do have one page per document. The more annoying part is having strange document names that need more attention when mapping page number -> name (I'm not saying this is not doable).
so, don't make files:
- write a tex file foo.tex - process it: context --batch --result=1 --once foo
etc ... so, use --result for the target name and use the same input name
(I won't bother you with the template system in context that no one knows of.)
Hans
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://context.aanhet.net archive : https://bitbucket.org/phg/context-mirror/commits/ wiki : http://contextgarden.net
___________________________________________________________________________________
Am 16.04.2020 um 16:52 schrieb Hans Hagen
: (I won't bother you with the template system in context that no one knows of.)
If you throw such bones, I get hungry – where’s the flesh? (Where is this in the sources? Is there any documentation?) I often need ConTeXt templates and mostly use simple replacements (like TITLE, CONTENT), but I’m used to Django templates (earlier Smarty/PHP and Freemarker/Java) and recently used Jinja2 with LaTeX. Best, Hraban
Henning Hraban Ramm schrieb am 16.04.2020 um 19:46:
Am 16.04.2020 um 16:52 schrieb Hans Hagen
: (I won't bother you with the template system in context that no one knows of.)
If you throw such bones, I get hungry – where’s the flesh? (Where is this in the sources? Is there any documentation?)
Look in the manual folder: templates-mkiv.pdf Wolfgang
Am 16.04.2020 um 19:57 schrieb Wolfgang Schuster
: Henning Hraban Ramm schrieb am 16.04.2020 um 19:46:
Am 16.04.2020 um 16:52 schrieb Hans Hagen
: (I won't bother you with the template system in context that no one knows of.) If you throw such bones, I get hungry – where’s the flesh? (Where is this in the sources? Is there any documentation?)
Look in the manual folder: templates-mkiv.pdf
Ah, the LMX templates. Of course I already had this in my bibliography but never had a deeper look, since it looked too Lua-centric to me. Thank you! Best, Hraban
On Thu, 16 Apr 2020 at 16:52, Hans Hagen wrote:
On 4/16/2020 4:38 PM, Mojca Miklavec wrote:
On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
One option is that I quickly draft a python script that creates a few thousand TeX documents and compiles them individually, but it might be easier if there was a way to just create a single template document and then run something like context --some-params --N=42 --output=document-0042.pdf template.tex or something along those lines.
If you want to go this route (and you may have to if not each record fits exactly within a single page),
I do have one page per document. The more annoying part is having strange document names that need more attention when mapping page number -> name (I'm not saying this is not doable).
so, don't make files:
- write a tex file foo.tex - process it: context --batch --result=1 --once foo
etc ... so, use --result for the target name and use the same input name
This works just perfect, thank you very much. I now have template.tex and process it with context --batch --result=doc-0042 --someparam=21a --once template which generates precisely the desired doc-0042.pdf. For the moment I'm simply using a combination of \doifdocumentargument {someparam} {\getdocumentargument{someparam}} from TeX and environment.arguments from within the lua code as suggested by Taco and you in the previous email thread. Where would be the best way to document this / under what wiki topic, as I'm sure I'll need it again and forget until then unless I write it down immediately? "Mail merge"? ;) Thank you very much, Mojca
On 4/16/20 8:32 PM, Mojca Miklavec wrote:
[...] Where would be the best way to document this / under what wiki topic, as I'm sure I'll need it again and forget until then unless I write it down immediately? "Mail merge"? ;)
Hi Mojca, “Document merge” could be also fine. Pablo -- http://www.ousia.tk
On 4/16/2020 8:32 PM, Mojca Miklavec wrote:
Where would be the best way to document this / under what wiki topic, as I'm sure I'll need it again and forget until then unless I write it down immediately? "Mail merge"? ;) maybe a 'workflows' entry?
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Thu, 16 Apr 2020 at 16:38, Mojca Miklavec wrote:
On Thu, 16 Apr 2020 at 11:29, Taco Hoekwater wrote:
On 16 Apr 2020, at 11:12, Mojca Miklavec wrote:
I have been asked to create a few thousand PDF documents from a CSV "database" today
In CPU cycles, the fastest way is to do a single context —once run generating all the pages as a single document, then using mutool merge to split it into separate documents using a (shell) loop.
Just to make it clear: I don't really need to optimize on the CPU end,
... says the optimist ... :) :) :)
as the bottleneck is on the other side of the keyboard, so as long as the CPU can process 5k pages today, I'm fine with it :) :) :)
While the bottleneck was in fact at the other side of the keyboard (preparation was certainly longer than the execution), it still took cca 2,5 hours to generate the full batch. (I'm pretty sure I could have further optimised the code, even though 1 second per run is still pretty fast [when I started using context it was more like 30 seconds per run], it just adds up when talking about thousands of pages. This greatly reminds me on the awesome speedup that Hans achieved when rewriting the mplib code & the initial \sometxt changes inside metapost which also lead to 100-fold speedups as one no longer needed to start TeX a zillion times.) While waiting I wanted to start being clever and do the processing in the same folder in parallel (I have lots of cores after all), and ended up calling a script with context --N={n} --output=doc-{nnnn}.pdf template.tex context --purge only to notice much later that running multiple context runs in the same folder (some of them compiling and some of them deleting the temporary files) might not have been the best idea on the planet, many documents ended up missing, and many corrupted. So I had to rerun half of the documents. One of the interesting statistics. I used a bunch of images (the same png images in all documents; cca. 290k in total). The generated documents were 1,5 GB in size. When compressed with tar.gz, there was almost no noticeable difference between the compressed and non-compressed data size (1,4 GB vs. 1,5 GB). But when compressing with tar.xz, it compressed 1,5 GB worth of document into merely 27 MB (a single document is 360 k). The documents have been e-mailed out, but now they need to print hard copies for archive. I'm happy I don't need to be the one printing and storing that :) :) :) Mojca
On 4/17/2020 4:37 PM, Mojca Miklavec wrote:
One of the interesting statistics. I used a bunch of images (the same png images in all documents; cca. 290k in total).
It can actually make a difference what kind of png image you use. Some png images demand a conversion (or split of map etc) to the format supported by pdf. Often converting the png to pdf and include those is faster. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Fri, 17 Apr 2020 at 21:11, Hans Hagen wrote:
On 4/17/2020 4:37 PM, Mojca Miklavec wrote:
One of the interesting statistics. I used a bunch of images (the same png images in all documents; cca. 290k in total).
It can actually make a difference what kind of png image you use. Some png images demand a conversion (or split of map etc) to the format supported by pdf. Often converting the png to pdf and include those is faster.
Thanks for the hint. But I tested it and it hardly makes any difference. I had to make another batch for the archive (creating a single document with 4k+ pages), and the full process ran in 10 minutes (compared to cca. 2,5 hours to create individual documents). Just for a test run I completely **removed** all the images and it only accounted for some 10 or 20 seconds speedup. So the biggest overhead still seems to be in warming up the machinery (which includes my share of overhead for reading in the 1,3 MB lua table with all data entries) and Taco's hint of using an external tool for splicing would have probably scored best :) I need to add that I'm extremely happy about the resource reuse (mostly images). As I already mentioned before, individual documents were 1,5 GB in total, and a badly written software would have created an equally bad cumulative PDF, while ConTeXt generates a merely 17 MB file with 4k+ pages. It's really impressive. Mojca
participants (7)
-
Hans Hagen
-
Henning Hraban Ramm
-
kaddour kardio
-
Mojca Miklavec
-
Pablo Rodriguez
-
Taco Hoekwater
-
Wolfgang Schuster