Re: [Dev-luatex] buffer size musings

5 Apr 2007

      Taco Hoekwater  writes:
...
David Kastrup wrote:
...
I'll also try thinking about some input encoding implementation
place: the process_input_buffer callback has several drawbacks: for
one
You should not use that callback for input reencoding anyway (I
thought you just did that because you wanted a quick hack). The idea
is to use the 'open_read_file' callback instead, because that
callback remains tied to the actual file.
Ok, I'll have to look at that venue then.
...
If you come up with a different system altogether, that is fine. But
I should warn you: I will not even look at partial solutions.
Either you fix reencoding for all cases, or not at all.
Oh, I expected as much.  "All cases" will, however, likely just cover
utf8, the various utf16 and 8bit input encodings (including
transparent) without escape characters.  For encodings with escapes
(pretty common IIRC in Asia), I don't really see a compact/versatile
solution that does not necessitate a large external library (the XeTeX
approach).

One rather crazy idea for such encodings would be a CCL interpreter:
CCL is a very small special-purpose bytecode interpreter (not to be
confused with Elisp bytecode) used within Emacs to convert a multitude
of encodings fast between Emacs' internal representation (which is
utf-8 in the emacs-unicode branch of development) and files.

Another idea would be, of course, leaving the details to Lua again and
implementing a subset that makes it reasonably easy to extend later.
...
And the same is true for buffer size limit problems. You have to
figure out a way to with \scantokens, or you may as well forget
about writing any code at all.
I see absolutely no point in solving buffer overflows for files when
we keep getting unrecoverable errors for \scantokens and luatex's
tex.print().
Do you actually expect megabyte lines from \scantokens and
tex.print()?  That question is serious and will affect the design.

I might be able to buffer the token list for \scantokens and keep it
around as a character source while the generated input is getting
processed, but I have to look at the details of its machinations.
tex.print() is a different case, but it might be possible to just pass
the Lua string variable into TeX, use it as input source, and free it
once processing completes.  That way it would occupy the Lua string
pool (which is better suited to deal with this sort of abuse) while
being read, but not let the TeX input buffer explode all at once.

If you'll accept a working complete solution, we'll have a deal.

There is one construct for which I have to keep a hard limit.  \csname
... \endcsname and, probably more important, \ifcsname ... \endcsname.
Reducing the total input buffer size to something more reasonable will
_definitely_ affect them, since they must fit the buffer completely.
The same goes for \somecontrolsequencename: this must also fit.

Except for \ifcsname, those constructs also permanently impact hash
table size, so they are probably not frequent.

-- 
David Kastrup