Re: [Dev-luatex] buffer size musings

5 Apr 2007

      Hans Hagen  writes:
...
David Kastrup wrote:
...
Since LuaTeX has its own complications to take care off with regard to
utf8, I actually would want to prepare this as a LuaTeX patch.
Backporting to PDFTeX should be straightforward.
hm, i was thinking that it would be some circular buffer between
reading from file and the normal tex reader (i.e. tex gets bytes and
already handles the utf8 part); so it should work for pdftex too
\endlinechar can take up a variable amount of space, and one would not
want to allow the buffer border to occur in the middle of an utf-8
_character_: that would necessitate allowing an "end of buffer"
condition to occur in far too many places.  Also the TeX error context
(which currently is something like 40 _bytes_, not characters) should
include a sufficient number of characters.

Doing this for LuaTeX and porting later to PDFTeX should work fine.
Doing it the other way round might get more awkward than thinking
about utf-8 right from the start.
...
there is no need to change the current luatex input handling since
one can overload reading from file as well as preprocess input lines
Hans, "no need to change" is a fine song.  But I am trying my best to
work on making TeX/LuaTeX become more robust for large scale
deployment because that is where the interest of my employer lies.  I
don't think that it conflicts with your goals.  If I get my employer
to agree, I'll also try investing time cleaning up after the
Aleph/Omega mess, by analyzing the code, its effects on existing
documents and TeX's efficiency and output, and discussing what makes
sense to keep, what to reimplement, what to fix.  I realize that the
code is there to stay, but there is actually nobody available who is
familiar with it, and its quality is sub-par because John Plaice
intended all of the quick hacks it consists of to be replaced with C++
code in Omega 2 and so invested no time in cleaning up or sensible
documentation.  Giuseppe did fix some things, but also did not really
delve much into the code, and is no longer available.  Of course, one
fixed goal is not to remove or break anything that Idris needs.  That
is understood.

Anyway, Lua is not an efficient tool for manipulating megabyte size
strings.  Strings get allocated in its string pool and stay at _fixed_
addresses until they are freed.  There is no compacting garbage
collection.  For that reason, Lua programmers work with string ropes,
which are quite less pleasant for manipulation.  The string
manipulations that Lua _does_ offer are not convenient or efficient
for input encoding transformations which work at a very fine
granularity, partly context-dependent.

Lua may be better than TeX itself, but anything is better than TeX.

I am not asking you to invest any time on this.  I do the stuff on my
own.  But I would ask for a fair evaluation once I have a working
solution.  And I fail to see why this evaluation should be harmed if I
consider not just the needs of 8bit PDFTeX, but rather the specific
problems around LuaTeX's utf8.

As a note aside: even systems that use utf8 as internal encoding
usually require an input and output translation (Emacs 23, for
example) in order to convert possibly illegal utf8 sequences into a
canonical reproducible legal utf-8 presentation and back again.

slnunicode is not prepared to do such verification/normalization, and
we would not want to have stability impacted in cases where illicit
utf-8 gets input.

Having an encoding frontend that assures us nothing but well-formed
utf-8 will ever enter LuaTeX (even when the input is random garbage)
is an advantage in my book.

Again, I am not asking you to invest any work on this.  And it is
nothing that I'll be submitting as part of my buffer size patch.  It
is merely something that I choose to keep in mind.

-- 
David Kastrup