Hi, since I am off for Easter, I just want to spread the implementation idea for dealing with arbitrary long lines, so that I can get some feedback before I get to actual coding. Since LuaTeX has its own complications to take care off with regard to utf8, I actually would want to prepare this as a LuaTeX patch. Backporting to PDFTeX should be straightforward. I'll also try thinking about some input encoding implementation place: the process_input_buffer callback has several drawbacks: for one thing, it can't expect to get complete lines if I prepare lines peacemeal. So it would probably have to get another argument "partial". For another, the line end needs to get detected in the first place. But line detection _depends_ on the encoding, at least if we are talking about utf16 flavors. And in particular if process_input_buffer gets partial lines, I'd like it to get lines that don't contain partial utf-8 sequences. So I do see a need for some more stuff specific to LuaTeX, and I don't want to design something that would be hard to port to it. So much for some background. The basic design would be the following: buffer happens to be a single buffer. This could probably be chosen with a total size of 32k (naturally, people will disagree here, but that will stay configurable). Before reading material from a file, TeX will _start_ by placing the \endlinechar before any other material, as a fixed 4-byte long utf-8 sequence (alternatively, it may be stored as part of the file data structure): its setting at the time of reading the file needs to get preserved until we finally reach the end of the file. Then the next line gets read into buffer, either until end of line is reached, or until the buffer read limit (2k sounds reasonable, could conceivably be configurable since it influences things like maximum size of \csname ...\endcsname) is hit. When processing material, we usually check for end of line condition, anyway. When those checks turn out true, we do another check for "really" end of line or just end of the buffered part. If it is just the end of the buffered part, then sufficient material from the end of the buffered part is copied to the front of the buffered part, more stuff is read in according to the buffer read limit (and possibly tacking on the buffered end line character at the end), and we resume. "sufficient material" means the maximum of a) 40 characters (probably 160 bytes will do) of error context for the input line context part of error messages. b) if we are in the middle of scanning a control sequence name, the beginning of the control sequence. If this copying process would not result in any more available space (making it possible to actually read in new material), we get the dreaded buffer overflow. Basically this concept appears sound to me. It would, however, be strictly restricted to file reading. Things like \scantokens and \csname (which also use the buffer) would still be required to have their argument fit in one piece. But I guess that the file reading should cover the largest problem area. Something like that. I hope I'll have something to show before EuroTeX. But as I said, I am away without net access (and without computer) for the next week. All the best, David -- David Kastrup
David Kastrup wrote:
I'll also try thinking about some input encoding implementation place: the process_input_buffer callback has several drawbacks: for one
You should not use that callback for input reencoding anyway (I thought you just did that because you wanted a quick hack). The idea is to use the 'open_read_file' callback instead, because that callback remains tied to the actual file. If you come up with a different system altogether, that is fine. But I should warn you: I will not even look at partial solutions. Either you fix reencoding for all cases, or not at all. And the same is true for buffer size limit problems. You have to figure out a way to with \scantokens, or you may as well forget about writing any code at all. I see absolutely no point in solving buffer overflows for files when we keep getting unrecoverable errors for \scantokens and luatex's tex.print(). Best wishes, Taco
Taco Hoekwater
David Kastrup wrote:
I'll also try thinking about some input encoding implementation place: the process_input_buffer callback has several drawbacks: for one
You should not use that callback for input reencoding anyway (I thought you just did that because you wanted a quick hack). The idea is to use the 'open_read_file' callback instead, because that callback remains tied to the actual file.
Ok, I'll have to look at that venue then.
If you come up with a different system altogether, that is fine. But I should warn you: I will not even look at partial solutions. Either you fix reencoding for all cases, or not at all.
Oh, I expected as much. "All cases" will, however, likely just cover utf8, the various utf16 and 8bit input encodings (including transparent) without escape characters. For encodings with escapes (pretty common IIRC in Asia), I don't really see a compact/versatile solution that does not necessitate a large external library (the XeTeX approach). One rather crazy idea for such encodings would be a CCL interpreter: CCL is a very small special-purpose bytecode interpreter (not to be confused with Elisp bytecode) used within Emacs to convert a multitude of encodings fast between Emacs' internal representation (which is utf-8 in the emacs-unicode branch of development) and files. Another idea would be, of course, leaving the details to Lua again and implementing a subset that makes it reasonably easy to extend later.
And the same is true for buffer size limit problems. You have to figure out a way to with \scantokens, or you may as well forget about writing any code at all.
I see absolutely no point in solving buffer overflows for files when we keep getting unrecoverable errors for \scantokens and luatex's tex.print().
Do you actually expect megabyte lines from \scantokens and tex.print()? That question is serious and will affect the design. I might be able to buffer the token list for \scantokens and keep it around as a character source while the generated input is getting processed, but I have to look at the details of its machinations. tex.print() is a different case, but it might be possible to just pass the Lua string variable into TeX, use it as input source, and free it once processing completes. That way it would occupy the Lua string pool (which is better suited to deal with this sort of abuse) while being read, but not let the TeX input buffer explode all at once. If you'll accept a working complete solution, we'll have a deal. There is one construct for which I have to keep a hard limit. \csname ... \endcsname and, probably more important, \ifcsname ... \endcsname. Reducing the total input buffer size to something more reasonable will _definitely_ affect them, since they must fit the buffer completely. The same goes for \somecontrolsequencename: this must also fit. Except for \ifcsname, those constructs also permanently impact hash table size, so they are probably not frequent. -- David Kastrup
Hi again, David Kastrup wrote:
And the same is true for buffer size limit problems. You have to figure out a way to with \scantokens, or you may as well forget about writing any code at all.
I see absolutely no point in solving buffer overflows for files when we keep getting unrecoverable errors for \scantokens and luatex's tex.print().
Do you actually expect megabyte lines from \scantokens and tex.print()? That question is serious and will affect the design.
Yes, that is definately possible (and even likely).
There is one construct for which I have to keep a hard limit. \csname ... \endcsname and, probably more important, \ifcsname ... \endcsname. Reducing the total input buffer size to something more reasonable will _definitely_ affect them, since they must fit the buffer completely. The same goes for \somecontrolsequencename: this must also fit.
I agree. Control sequence names longer than say 50 characters are unwieldy in practise anway, and defining csnames for the sake of hashing can better be done using lua strings. If you arrange for a \csname has to fit inside one of your 1K windows, that should be fine. Best, Taco
Taco Hoekwater
Hi again,
David Kastrup wrote:
And the same is true for buffer size limit problems. You have to figure out a way to with \scantokens, or you may as well forget about writing any code at all.
I see absolutely no point in solving buffer overflows for files when we keep getting unrecoverable errors for \scantokens and luatex's tex.print().
Do you actually expect megabyte lines from \scantokens and tex.print()? That question is serious and will affect the design.
Yes, that is definately possible (and even likely).
There is one construct for which I have to keep a hard limit. \csname ... \endcsname and, probably more important, \ifcsname ... \endcsname. Reducing the total input buffer size to something more reasonable will _definitely_ affect them, since they must fit the buffer completely. The same goes for \somecontrolsequencename: this must also fit.
I agree. Control sequence names longer than say 50 characters are unwieldy in practise anway, and defining csnames for the sake of hashing can better be done using lua strings. If you arrange for a \csname has to fit inside one of your 1K windows, that should be fine.
suffix.sty does something like ... \futurelet\a\b} \def\b{\ifcsname xxx@\meaning\a\endcsname ...} and expects to see things like "the character *" or so in \meaning. If instead the meaning is a macro containing a few thousand characters, TeX will panic. Since \WithSuffix is used for optional arguments, and those are supposed to be followed by "an opening brace {" or similar easily described things, this is not likely to cause trouble as long as one keeps braces around one's argument. So this is an application which could cause TeX to bomb out on certain input that is rather bold about what it feeds to \ifcsname, relying on the guarantee that it will not impact the hash space. suffix.sty is my own package. I know of no other package doing similarly reckless things, and I think it unlikely people will manage to trigger the problematic cases. Just wanted to mention it. And one will always get by with increasing the buffer size. -- David Kastrup
David Kastrup wrote:
Since LuaTeX has its own complications to take care off with regard to utf8, I actually would want to prepare this as a LuaTeX patch. Backporting to PDFTeX should be straightforward.
hm, i was thinking that it would be some circular buffer between reading from file and the normal tex reader (i.e. tex gets bytes and already handles the utf8 part); so it should work for pdftex too there is no need to change the current luatex input handling since one can overload reading from file as well as preprocess input lines Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen
David Kastrup wrote:
Since LuaTeX has its own complications to take care off with regard to utf8, I actually would want to prepare this as a LuaTeX patch. Backporting to PDFTeX should be straightforward.
hm, i was thinking that it would be some circular buffer between reading from file and the normal tex reader (i.e. tex gets bytes and already handles the utf8 part); so it should work for pdftex too
\endlinechar can take up a variable amount of space, and one would not want to allow the buffer border to occur in the middle of an utf-8 _character_: that would necessitate allowing an "end of buffer" condition to occur in far too many places. Also the TeX error context (which currently is something like 40 _bytes_, not characters) should include a sufficient number of characters. Doing this for LuaTeX and porting later to PDFTeX should work fine. Doing it the other way round might get more awkward than thinking about utf-8 right from the start.
there is no need to change the current luatex input handling since one can overload reading from file as well as preprocess input lines
Hans, "no need to change" is a fine song. But I am trying my best to work on making TeX/LuaTeX become more robust for large scale deployment because that is where the interest of my employer lies. I don't think that it conflicts with your goals. If I get my employer to agree, I'll also try investing time cleaning up after the Aleph/Omega mess, by analyzing the code, its effects on existing documents and TeX's efficiency and output, and discussing what makes sense to keep, what to reimplement, what to fix. I realize that the code is there to stay, but there is actually nobody available who is familiar with it, and its quality is sub-par because John Plaice intended all of the quick hacks it consists of to be replaced with C++ code in Omega 2 and so invested no time in cleaning up or sensible documentation. Giuseppe did fix some things, but also did not really delve much into the code, and is no longer available. Of course, one fixed goal is not to remove or break anything that Idris needs. That is understood. Anyway, Lua is not an efficient tool for manipulating megabyte size strings. Strings get allocated in its string pool and stay at _fixed_ addresses until they are freed. There is no compacting garbage collection. For that reason, Lua programmers work with string ropes, which are quite less pleasant for manipulation. The string manipulations that Lua _does_ offer are not convenient or efficient for input encoding transformations which work at a very fine granularity, partly context-dependent. Lua may be better than TeX itself, but anything is better than TeX. I am not asking you to invest any time on this. I do the stuff on my own. But I would ask for a fair evaluation once I have a working solution. And I fail to see why this evaluation should be harmed if I consider not just the needs of 8bit PDFTeX, but rather the specific problems around LuaTeX's utf8. As a note aside: even systems that use utf8 as internal encoding usually require an input and output translation (Emacs 23, for example) in order to convert possibly illegal utf8 sequences into a canonical reproducible legal utf-8 presentation and back again. slnunicode is not prepared to do such verification/normalization, and we would not want to have stability impacted in cases where illicit utf-8 gets input. Having an encoding frontend that assures us nothing but well-formed utf-8 will ever enter LuaTeX (even when the input is random garbage) is an advantage in my book. Again, I am not asking you to invest any work on this. And it is nothing that I'll be submitting as part of my buffer size patch. It is merely something that I choose to keep in mind. -- David Kastrup
participants (3)
-
David Kastrup
-
Hans Hagen
-
Taco Hoekwater