buffer size musings

5 Apr 2007

      Hi,

since I am off for Easter, I just want to spread the implementation
idea for dealing with arbitrary long lines, so that I can get some
feedback before I get to actual coding.

Since LuaTeX has its own complications to take care off with regard to
utf8, I actually would want to prepare this as a LuaTeX patch.
Backporting to PDFTeX should be straightforward.

I'll also try thinking about some input encoding implementation place:
the process_input_buffer callback has several drawbacks: for one
thing, it can't expect to get complete lines if I prepare lines
peacemeal.  So it would probably have to get another argument
"partial".  For another, the line end needs to get detected in the
first place.  But line detection _depends_ on the encoding, at least
if we are talking about utf16 flavors.  And in particular if
process_input_buffer gets partial lines, I'd like it to get lines that
don't contain partial utf-8 sequences.

So I do see a need for some more stuff specific to LuaTeX, and I don't
want to design something that would be hard to port to it.

So much for some background.

The basic design would be the following: buffer happens to be a single
buffer.  This could probably be chosen with a total size of 32k
(naturally, people will disagree here, but that will stay
configurable).  Before reading material from a file, TeX will _start_
by placing the \endlinechar before any other material, as a fixed
4-byte long utf-8 sequence (alternatively, it may be stored as part of
the file data structure): its setting at the time of reading the file
needs to get preserved until we finally reach the end of the file.

Then the next line gets read into buffer, either until end of line is
reached, or until the buffer read limit (2k sounds reasonable, could
conceivably be configurable since it influences things like maximum
size of \csname ...\endcsname) is hit.  When processing material, we
usually check for end of line condition, anyway.  When those checks
turn out true, we do another check for "really" end of line or just
end of the buffered part.

If it is just the end of the buffered part, then sufficient material
from the end of the buffered part is copied to the front of the
buffered part, more stuff is read in according to the buffer read
limit (and possibly tacking on the buffered end line character at the
end), and we resume.

"sufficient material" means the maximum of
a) 40 characters (probably 160 bytes will do) of error context for the
input line context part of error messages.
b) if we are in the middle of scanning a control sequence name,
the beginning of the control sequence.

If this copying process would not result in any more available space
(making it possible to actually read in new material), we get the
dreaded buffer overflow.

Basically this concept appears sound to me.  It would, however, be
strictly restricted to file reading.  Things like \scantokens and
\csname (which also use the buffer) would still be required to have
their argument fit in one piece.  But I guess that the file reading
should cover the largest problem area.

Something like that.  I hope I'll have something to show before
EuroTeX.  But as I said, I am away without net access (and without
computer) for the next week.

All the best,
David

-- 
David Kastrup

David Kastrup

Taco Hoekwater

David Kastrup

Taco Hoekwater

David Kastrup

Hans Hagen

David Kastrup

tags

participants (3)