Parallelizing typesetting of large documents with lots of cross-references
Hello, This email is largely a simple notification of one "Fool's" dream... ("Only Fools rush in where Angels fear to tread"). I am currently attempting to create "a" (crude) "tool" with which I can typeset: - very large (1,000+ pages), - highly cross-referenced documents, - with embedded literate-programmed code (which needs concurrent compiling and execution), - containing multiple MetaFun graphics, all based upon ConTeXt-LMTX. "In theory", it should be possible to typeset individual "sub-documents" (any section which is known to start on a page boundary rather than inside a page), and then re-combine the individual PDFs back into one single PDF for the whole document (complete with control over the page numbering). The inherent problem is that the *whole* of a ConTeXt document depends upon cross-references from *everywhere* else in the document. TeX and ConTeXt "solve" this problem by using a multi-pass approach (in, for example, 5 passes for the `luametatex` document). Between each pass, ConTeXt saves this multi-pass data (page numbers and cross-references) in the `*.tuc` file. Clearly any parallelization approach needs to have a process which coordinates the update and re-distribution of any changes in this multi-pass data obtained by typesetting each "sub-document". My current approach is to have a federation of Docker/Podman "pods". Each "pod" would have a number of ConTeXt workers, as well as (somewhere in the federation) a Lua based Multi-Pass-Data-coordinator. All work would be coordinated by messages sent and received over a corresponding federation of [NATS servers](https://nats.io/). (Neither [Podman](https://podman.io/) pods nor NATS message coordination are problems at the moment). -------------------------------------------------------------------- **The real problem**, for typesetting a ConTeXt document, is the design of the critical process which will act as a "Multi-Pass-Data-coordinator". -------------------------------------------------------------------- All ConTeXt sub-documents would be typeset in "once" mode using the latest complete set of "Multi-Pass-Data" obtained from the central coordinator. Then, once each typesetting run is complete, the resulting "Multi-Pass-Data" would be sent back to the coordinator to be used to update the coordinator's complete set of "Multi-Pass-Data" ready for any required next typesetting pass. (From the `context --help`:
mtx-context | --once only run once (no multipass data file is produced) I will clearly have to patch(?) the mtx-context.lua script to allow multipass data to be produced... this is probably not a problem).
(There would also be a number of additional processes/containers for dependency analysis, build sequencing, compilation of code, execution or interpretation of the code, stitching the PDFs back into one PDF, etc -- these processes are also not the really critical problem at the moment). -------------------------------------------------------------------- QUESTIONS: 1. Are there any other known attempts to parallelize context? 2. Are there any other obvious problems with my approach? 3. Is there any existing documentation on the contents of the `*.tuc` file? 4. If there is no such documentation, is there any naming pattern of the Lua functions which get/set this multi-pass information that I should be aware of? -------------------------------------------------------------------- Many thanks for all of the very useful comments so far... Regards, Stephen Gaito
On 3 Dec 2020, at 12:04, Stephen Gaito
wrote: 1. Are there any other known attempts to parallelize context?
Not that I know of, except for the tricks I mentioned in my earlier mail today.
2. Are there any other obvious problems with my approach?
The big problem with references is that changed / resolved references can change other (future) references because the typeset length can be different, shifting a following reference to another page, which in turn can push another reference to yet another page, perhaps changing a page break, et cetera. That is why the meta manual needs five runs, otherwise a max of two runs would always be enough (assuming no outside processing like generating a bibliography or index is needed). So your —once approach may fail in some cases, sorry. Actually, the meta manual really *needs* only four runs. The last run is the one that verifies that the .tuc file has not changed (that is why a ConTeXt document with no cross-references at all uses two runs, and is one of the reasons for the existence of the —once switch). Depending on your docs, you may be able to skip a run by using —runs yourself. Best wishes, Taco
On 12/3/2020 12:04 PM, Stephen Gaito wrote:
- very large (1,000+ pages),
not that large, literate code is often verbatim so that doesn't take much runtime either
- highly cross-referenced documents,
ok, that demands runs
- with embedded literate-programmed code (which needs concurrent compiling and execution),
you only need to process those snippets when something has changed and there are ways in context to deal with that (like \typesetbuffer and such which only processes when something changed between runs)
- containing multiple MetaFun graphics,
those don't take time assuming effecitne metapost code Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (3)
-
Hans Hagen
-
Stephen Gaito
-
Taco Hoekwater