Hans Hagen
David Kastrup wrote:
Since LuaTeX has its own complications to take care off with regard to utf8, I actually would want to prepare this as a LuaTeX patch. Backporting to PDFTeX should be straightforward.
hm, i was thinking that it would be some circular buffer between reading from file and the normal tex reader (i.e. tex gets bytes and already handles the utf8 part); so it should work for pdftex too
\endlinechar can take up a variable amount of space, and one would not want to allow the buffer border to occur in the middle of an utf-8 _character_: that would necessitate allowing an "end of buffer" condition to occur in far too many places. Also the TeX error context (which currently is something like 40 _bytes_, not characters) should include a sufficient number of characters. Doing this for LuaTeX and porting later to PDFTeX should work fine. Doing it the other way round might get more awkward than thinking about utf-8 right from the start.
there is no need to change the current luatex input handling since one can overload reading from file as well as preprocess input lines
Hans, "no need to change" is a fine song. But I am trying my best to work on making TeX/LuaTeX become more robust for large scale deployment because that is where the interest of my employer lies. I don't think that it conflicts with your goals. If I get my employer to agree, I'll also try investing time cleaning up after the Aleph/Omega mess, by analyzing the code, its effects on existing documents and TeX's efficiency and output, and discussing what makes sense to keep, what to reimplement, what to fix. I realize that the code is there to stay, but there is actually nobody available who is familiar with it, and its quality is sub-par because John Plaice intended all of the quick hacks it consists of to be replaced with C++ code in Omega 2 and so invested no time in cleaning up or sensible documentation. Giuseppe did fix some things, but also did not really delve much into the code, and is no longer available. Of course, one fixed goal is not to remove or break anything that Idris needs. That is understood. Anyway, Lua is not an efficient tool for manipulating megabyte size strings. Strings get allocated in its string pool and stay at _fixed_ addresses until they are freed. There is no compacting garbage collection. For that reason, Lua programmers work with string ropes, which are quite less pleasant for manipulation. The string manipulations that Lua _does_ offer are not convenient or efficient for input encoding transformations which work at a very fine granularity, partly context-dependent. Lua may be better than TeX itself, but anything is better than TeX. I am not asking you to invest any time on this. I do the stuff on my own. But I would ask for a fair evaluation once I have a working solution. And I fail to see why this evaluation should be harmed if I consider not just the needs of 8bit PDFTeX, but rather the specific problems around LuaTeX's utf8. As a note aside: even systems that use utf8 as internal encoding usually require an input and output translation (Emacs 23, for example) in order to convert possibly illegal utf8 sequences into a canonical reproducible legal utf-8 presentation and back again. slnunicode is not prepared to do such verification/normalization, and we would not want to have stability impacted in cases where illicit utf-8 gets input. Having an encoding frontend that assures us nothing but well-formed utf-8 will ever enter LuaTeX (even when the input is random garbage) is an advantage in my book. Again, I am not asking you to invest any work on this. And it is nothing that I'll be submitting as part of my buffer size patch. It is merely something that I choose to keep in mind. -- David Kastrup