Taco Hoekwater
David Kastrup wrote:
I'll also try thinking about some input encoding implementation place: the process_input_buffer callback has several drawbacks: for one
You should not use that callback for input reencoding anyway (I thought you just did that because you wanted a quick hack). The idea is to use the 'open_read_file' callback instead, because that callback remains tied to the actual file.
Ok, I'll have to look at that venue then.
If you come up with a different system altogether, that is fine. But I should warn you: I will not even look at partial solutions. Either you fix reencoding for all cases, or not at all.
Oh, I expected as much. "All cases" will, however, likely just cover utf8, the various utf16 and 8bit input encodings (including transparent) without escape characters. For encodings with escapes (pretty common IIRC in Asia), I don't really see a compact/versatile solution that does not necessitate a large external library (the XeTeX approach). One rather crazy idea for such encodings would be a CCL interpreter: CCL is a very small special-purpose bytecode interpreter (not to be confused with Elisp bytecode) used within Emacs to convert a multitude of encodings fast between Emacs' internal representation (which is utf-8 in the emacs-unicode branch of development) and files. Another idea would be, of course, leaving the details to Lua again and implementing a subset that makes it reasonably easy to extend later.
And the same is true for buffer size limit problems. You have to figure out a way to with \scantokens, or you may as well forget about writing any code at all.
I see absolutely no point in solving buffer overflows for files when we keep getting unrecoverable errors for \scantokens and luatex's tex.print().
Do you actually expect megabyte lines from \scantokens and tex.print()? That question is serious and will affect the design. I might be able to buffer the token list for \scantokens and keep it around as a character source while the generated input is getting processed, but I have to look at the details of its machinations. tex.print() is a different case, but it might be possible to just pass the Lua string variable into TeX, use it as input source, and free it once processing completes. That way it would occupy the Lua string pool (which is better suited to deal with this sort of abuse) while being read, but not let the TeX input buffer explode all at once. If you'll accept a working complete solution, we'll have a deal. There is one construct for which I have to keep a hard limit. \csname ... \endcsname and, probably more important, \ifcsname ... \endcsname. Reducing the total input buffer size to something more reasonable will _definitely_ affect them, since they must fit the buffer completely. The same goes for \somecontrolsequencename: this must also fit. Except for \ifcsname, those constructs also permanently impact hash table size, so they are probably not frequent. -- David Kastrup