Hi, I just tried doing luatex -ini latex.ltx with a freshly checked out LuaTeX. The result is This is luaTeX, Version 3.141592-snapshot-2007032611 (Web2C 7.5.6) (INITEX) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/latex.ltx (/usr/local/texlive/2007/texmf-dist/tex/latex/base/texsys.cfg) ./texsys.aux found \@currdir set to: ./. Assuming \openin and \input have the same search path. Defining UNIX/DOS style filename parser. catcodes, registers, compatibility for TeX 2, parameters, LaTeX2e <2005/12/01> hacks, control, par, spacing, files, font encodings, lengths, ==================================== Local config file fonttext.cfg used ==================================== (/usr/local/texlive/2007/texmf-dist/tex/cslatex/base/fonttext.cfg (/usr/local/texlive/2007/texmf-dist/tex/latex/base/omlenc.def) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/t1enc.def) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1enc.def) (/usr/local/texlive/2007/texmf-dist/tex/latex/cslatex/il2enc.def) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/omsenc.def) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/t1cmr.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmr.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/cslatex/il2cmr.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmss.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmtt.fd)) ==================================== Local config file fontmath.cfg used ==================================== (/usr/local/texlive/2007/texmf-dist/tex/latex/base/fontmath.cfg (/usr/local/texlive/2007/texmf-dist/tex/latex/base/fontmath.ltx === Don't modify this file, use a .cfg file instead === (/usr/local/texlive/2007/texmf-dist/tex/latex/base/omlcmm.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/omscmsy.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/omxcmex.fd) (/usr/local/texlive/2007/texmf-dist/tex/latex/base/ucmr.fd))) ==================================== Local config file preload.cfg used ===================================== (/usr/local/texlive/2007/texmf/tex/generic/config/preload.cfg (/usr/local/texlive/2007/texmf-dist/tex/latex/base/preload.ltx)) page nos., x-ref, environments, center, verbatim, math definitions, boxes, title, sectioning, contents, floats, footnotes, index, bibliography, output, ! Buffer contains an invalid utf-8 sequence. l.7804 \lccode`\ �=`\i % dotted I ? ! Pool contains an invalid utf-8 sequence . l.7804 \lccode`\� =`\i % dotted I ? ! Buffer contains an invalid utf-8 sequence. l.7805 \uccode`\ �=`\^^9d % dotted I ? ! Pool contains an invalid utf-8 sequence . l.7805 \uccode`\� =`\^^9d % dotted I ? ! Buffer contains an invalid utf-8 sequence. l.7805 \uccode`\�=`\ � % dotted I ? ! Pool contains an invalid utf-8 sequence [...] Now the sequences in question are: \ifnum\inputlineno=\m@ne\else \lccode`\^^9d=`\i % dotted I \uccode`\^^9d=`\^^9d % dotted I \lccode`\^^9e=`\^^9e % d-bar \uccode`\^^9e=`\^^d0 % d-bar \fi In short: the buffer does not contain any illegal utf-8 sequence at all! latex.ltx consists _solely_ of ASCII characters in the range 0-127. Instead, LuaTeX barfs on "\^^9d" and similar ASCII _transliterations_ of characters which happen to be legal _characters_ in Unicode (though not legal _bytes_ in utf-8). (/usr/local/texlive/2007/texmf-dist/tex/generic/xu-hyphen/xu-bahyph.tex ! Text line contains an invalid utf-8 sequence. l.17 \lccode`\ �=0 ? ! Text line contains an invalid utf-8 sequence. l.20 \ifnum\lccode`\ �=0 % if bahyph.tex didn't change this, ? Again, the input file is purely ASCII, in this case \begingroup \expandafter\ifx\csname XeTeXrevision\endcsname\relax \else % The standard bahyph.tex is plain ASCII, so directly readable; % but we want to add patterns for n-tilde (^^f1), as generated by % bahyph.sh if the "latin1" option is given. % However, if a "latin1" version of bahyph was already present, % these would be duplicate patterns. % We'll watch the \lccode of ^^f1 so as to detect this. \lccode`\^^f1=0 \let\PATTERNS=\patterns \def\patterns{% \ifnum\lccode`\^^f1=0 % if bahyph.tex didn't change this, \lccode`\^^f1=`\^^f1 % then we can load the extra patterns here \PATTERNS{1^^f1a 1^^f1e 1^^f1o 1^^f1i 1^^f1u}% \fi \PATTERNS } \fi So we have error messages about "pool", "buffer" and "text line" containing invalid utf-8 sequences, when the input actually is just ASCII. -- David Kastrup
Instead, LuaTeX barfs on "\^^9d" and similar ASCII _transliterations_ of characters which happen to be legal _characters_ in Unicode (though not legal _bytes_ in utf-8).
Good spot, I already noticed there was many problems with latex but I thought it was mainly due to pattern files (and I gave up very early on LaTeX in LuaTeX anyway). I suppose the ^^ notation should yield a UTF-8 encoded sequence and not an individual byte (XeTeX indeed is perfectly happy with it).
Arthur Reutenauer wrote:
Instead, LuaTeX barfs on "\^^9d" and similar ASCII _transliterations_ of characters which happen to be legal _characters_ in Unicode (though not legal _bytes_ in utf-8).
Good spot, I already noticed there was many problems with latex but I thought it was mainly due to pattern files (and I gave up very early on LaTeX in LuaTeX anyway). I suppose the ^^ notation should yield a UTF-8 encoded sequence and not an individual byte (XeTeX indeed is perfectly happy with it).
It worked before, so I probably messed up something along the way. It is safe to assume there will be a fix in the next snapshot. Taco
Taco Hoekwater
Arthur Reutenauer wrote:
Instead, LuaTeX barfs on "\^^9d" and similar ASCII _transliterations_ of characters which happen to be legal _characters_ in Unicode (though not legal _bytes_ in utf-8).
Good spot, I already noticed there was many problems with latex but I thought it was mainly due to pattern files (and I gave up very early on LaTeX in LuaTeX anyway). I suppose the ^^ notation should yield a UTF-8 encoded sequence and not an individual byte (XeTeX indeed is perfectly happy with it).
It worked before, so I probably messed up something along the way. It is safe to assume there will be a fix in the next snapshot.
Anyway: I think it is a safe assumption that LuaTeX should be able to deal with current versions of LaTeX (I think it would be a mistake to have to rely on lambda). So the kind of utf-8 support (OTP or something) used for Omega needs to be somewhat optional. I don't have any clue about the current implementation, but the amount of error messages I got suggests there are several areas involved. Here is my take on what would constitute a sane environment (some of that probably is already implemented in XeTeX) in my opinion: Single characters: encoded in unicode (UCS-21 or similar). Input line buffer: array of single characters. Characters are created from input by using the input coding system of the file (basically one of 8-bit, utf-8, at some later point of time possibly also things like utf-16-le or utf-16-be). LaTeX would be fixed to "transparent" at first. Which would make it work like before. However, one would want to eventually add something like an utf8l input encoding in order to have it behave more sanely. String space: utf-8 encoded. This is probably incompatible with previous code, but saves space. Log and console output: switchable utf-8 or 8-bit, probably depending on locale and/or inherited from the mode of the current input file. In "8-bit" mode, obviously all characters with a code point above 256 need to get output as ^^^^abcd or ^^^^^^01abcd or similar. Write streams: similar. It might be possible to generally write utf-8, but then it might be a good idea to add a byte order mark at the start of files so that \input on such files will flip the coding system appropriately. I really need to take a look at XeTeX. -- David Kastrup
David Kastrup wrote:
So the kind of utf-8 support (OTP or something) used for Omega needs to be somewhat optional.
No, the error is simply a bug. All I/O characters that are visible to the bare engine is, and will be, utf-8 encoded. If you want to do bare bytes, you have to preprocess them in lua. Taco
Taco Hoekwater
David Kastrup wrote:
So the kind of utf-8 support (OTP or something) used for Omega needs to be somewhat optional.
No, the error is simply a bug. All I/O characters that are visible to the bare engine is, and will be, utf-8 encoded.
What is "the bare engine"? From the TeX side, one sees Unicode characters.
If you want to do bare bytes, you have to preprocess them in lua.
How do you interpret input bytes that don't form valid utf-8 sequences? As long as they are preserved in some recognizable manner, it should be possible to do this sort of reverse conversion to the original bytes, but it certainly does not sound like it would make for attractive speed. -- David Kastrup
David Kastrup wrote:
Taco Hoekwater
writes: So the kind of utf-8 support (OTP or something) used for Omega needs to be somewhat optional. No, the error is simply a bug. All I/O characters that are visible to
David Kastrup wrote: the bare engine is, and will be, utf-8 encoded.
What is "the bare engine"? From the TeX side, one sees Unicode characters.
The bare engine is the compiled executable code. Filtering and reencoding can be done using lua scripts, and those are interpreted (i.e. runtime). This is discussed in the reference manual, so if you have not looked at that yet, please do so before replying to this message. If you believe it is possible to support arbitrary 8-bit encodings while supporting utf-8 properly at the same time feel free to donate the pascal web/C code to do so. I am not willing to spend time on that myself, considering we have a scripting language builtin that is ideally suited to take care of this problem. Supporting utf-8 properly means: no need to have active \catcode-s for >128, but allow utf-8 sequences to be treated as a single character everywhere (for example in messaging, to be used inside \csnames, and as argument to \catcode c.s.), and also remove the need for port-dependant things like tcx files and -8bit. Best wishes, Taco
Taco Hoekwater
[...] This is discussed in the reference manual, so if you have not looked at that yet, please do so before replying to this message.
Dangerous advice since this gives me ideas... Here is something I find worth giving a different API: \subsubsection{\callback{token_filter}} This callback allows you to change the fetch and preprocess any lexical token that enters \LUATEX, before \LUATEX\ executes or expands the associated command. \startfunctioncall function () return table <token> end \stopfunctioncall The calling convention for this callback is bit more complicated then for most other callbacks. The function should either return a lua table representing a valid to-be-processed token or tokenlist, or something else like nil or an empty table. If your lua function does not return a table representing a valid token, it will be immediately called again, until it eventually does return a useful token or tokenlist (or until you reset the callback value to nil). See the description of \callbacklib{token} for some handy functions to be used in conjunction with this callback. If your function returns a single usable token, then that token will be processed by \LUATEX\ immediately. If the function returns a token list (a table consisting of a list of consecutive token tables), then that list will be pushed to the input stack as completely new token list level, with it's token type set to `inserted'. In either case, the returned token(s) will not be fed back into the callback function. I think that I would like to propose a much more luatic solution: If token_filter is set, it is called with one argument \verb|get_next|, the function originally supposed to get the next token. token_filter should then call this function as often as it needs to (possibly zero times) and return one token to the caller. If you need to readahead and buffer tokens (like when simulating OTPs), the easiest way to do this is using something like the following for the filter function: coroutine.wrap(function(get_token) while true local token1 = get_token() if token1.cmd != "^" then get_token = coroutine.yield(token1) else local token2 = get_token() if token2.cmd != "^" then coroutine.yield(token1) get_token = coroutine.yield(token2) else local token3 = get_token() if token3.cmd ... then get_token = coroutine.yield(something) else coroutine.yield(token1) coroutine.yield(token2) get_token = coroutine.yield(token3) end end end end end) Ok, the code itself is nonsensically, but it should illustrate the working principle: if the filtering is not 1:1, one can use a coroutine for analysing the input, buffering and producing the tokens. This approach also has the advantage that one can stack filter functions easily. The existing interface makes that much harder: I actually have no good idea how one would go about it. One problem with this approach is that the lookahead kept internally within a coroutine will get lost when one switches the filter function out (not that the current approach fares better here). One solution might be to pass an artificial "EOF" token to the filter function as the last act before removing it from token_filter, and accepting a list of lookahead tokens as the return value. -- David Kastrup
David@lola.quinscape.zz wrote:
Taco Hoekwater
writes: [...] This is discussed in the reference manual, so if you have not looked at that yet, please do so before replying to this message.
Dangerous advice since this gives me ideas...
Here is something I find worth giving a different API:
\subsubsection{\callback{token_filter}}
This callback allows you to change the fetch and preprocess any lexical token that enters \LUATEX, before \LUATEX\ executes or expands the associated command.
\startfunctioncall function () return table <token> end \stopfunctioncall
The calling convention for this callback is bit more complicated then for most other callbacks. The function should either return a lua table representing a valid to-be-processed token or tokenlist, or something else like nil or an empty table.
If your lua function does not return a table representing a valid token, it will be immediately called again, until it eventually does return a useful token or tokenlist (or until you reset the callback value to nil). See the description of \callbacklib{token} for some handy functions to be used in conjunction with this callback.
If your function returns a single usable token, then that token will be processed by \LUATEX\ immediately. If the function returns a token list (a table consisting of a list of consecutive token tables), then that list will be pushed to the input stack as completely new token list level, with it's token type set to `inserted'. In either case, the returned token(s) will not be fed back into the callback function.
I think that I would like to propose a much more luatic solution:
If token_filter is set, it is called with one argument \verb|get_next|, the function originally supposed to get the next token.
token_filter should then call this function as often as it needs to (possibly zero times) and return one token to the caller.
If you need to readahead and buffer tokens (like when simulating OTPs), the easiest way to do this is using something like the following for the filter function:
coroutine.wrap(function(get_token) while true local token1 = get_token() if token1.cmd != "^" then get_token = coroutine.yield(token1) else local token2 = get_token() if token2.cmd != "^" then coroutine.yield(token1) get_token = coroutine.yield(token2) else local token3 = get_token() if token3.cmd ... then get_token = coroutine.yield(something) else coroutine.yield(token1) coroutine.yield(token2) get_token = coroutine.yield(token3) end end end end end)
Ok, the code itself is nonsensically, but it should illustrate the working principle: if the filtering is not 1:1, one can use a coroutine for analysing the input, buffering and producing the tokens. This approach also has the advantage that one can stack filter functions easily.
The existing interface makes that much harder: I actually have no good idea how one would go about it.
One problem with this approach is that the lookahead kept internally within a coroutine will get lost when one switches the filter function out (not that the current approach fares better here). One solution might be to pass an artificial "EOF" token to the filter function as the last act before removing it from token_filter, and accepting a list of lookahead tokens as the return value.
the problem is that this is real slow which renders it rather unusable, even the current implementation is already on the edge of acceptable why do you want to handle the ^'s? you can do that using the input line callback Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen
the problem is that this is real slow which renders it rather unusable, even the current implementation is already on the edge of acceptable
why do you want to handle the ^'s?
you can do that using the input line callback
That was just a stupid example of some text transformation in order to illustrate how one could do this sort of thing with the proposed changed token_filter semantics. I certainly would not want to do anything like that for actually handling "^^". It was just an example for a less than trivial task solved using token_filter: I find the current semantics of that hook to be quite contorted. Can you think of a particular task implemented using the current token_filter semantics that would become noticeably slower using the simpler semantics I proposed, namely calling token_filter when a token is expected and giving it as an argument a routine to call for fetching a token to transform? I don't see a task that can be implemented better (faster or easier) with the current semantics than with the proposed simplified semantics. Could you give an example? -- David Kastrup
Hans Hagen wrote:
If you need to readahead and buffer tokens (like when simulating OTPs), the easiest way to do this is using something like the following for the filter function:
If you need to read ahead for tokens, just run token.get_next() in a loop that stores tokens in a local table until you are happy. Then return that table after processing it. There is no need to return to the TeX control loop before that. Best, Taco
I wrote:
coroutine.wrap(function(get_token) while true local token1 = get_token() if token1.cmd != "^" then get_token = coroutine.yield(token1) else local token2 = get_token() if token2.cmd != "^" then coroutine.yield(token1) get_token = coroutine.yield(token2) else local token3 = get_token() if token3.cmd ... then get_token = coroutine.yield(something) else coroutine.yield(token1) coroutine.yield(token2) get_token = coroutine.yield(token3) end end end end end)
One problem with this approach is that the lookahead kept internally within a coroutine will get lost when one switches the filter function out (not that the current approach fares better here). One solution might be to pass an artificial "EOF" token to the filter function as the last act before removing it from token_filter, and accepting a list of lookahead tokens as the return value.
About that lookahead and phasing the filter routine out: more consistent would probably be the following: when there is no longer input for the filter routine, pass it nil as input routine. Once it bleeds out nil, it is finished. So we get: coroutine.wrap(function(get_token) while get_token != nil local token1 = get_token() if token1.cmd != "^" then get_token = coroutine.yield(token1) else local token2 = get_token() if token2.cmd != "^" then coroutine.yield(token1) get_token = coroutine.yield(token2) else local token3 = get_token() if token3.cmd ... then get_token = coroutine.yield(something) else coroutine.yield(token1) coroutine.yield(token2) get_token = coroutine.yield(token3) end end end end return nil end) Managing multiple filter functions will still be some work, probably requiring the use of a suitable helper function. But basically, I find this sort of interface more natural than the current token_filter semantics. -- David Kastrup
David Kastrup wrote:
Taco Hoekwater
writes: David Kastrup wrote:
So the kind of utf-8 support (OTP or something) used for Omega needs to be somewhat optional.
No, the error is simply a bug. All I/O characters that are visible to the bare engine is, and will be, utf-8 encoded.
What is "the bare engine"? From the TeX side, one sees Unicode characters.
If you want to do bare bytes, you have to preprocess them in lua.
How do you interpret input bytes that don't form valid utf-8 sequences? As long as they are preserved in some recognizable manner, it should be possible to do this sort of reverse conversion to the original bytes, but it certainly does not sound like it would make for attractive speed.
you can define a callback that will intercept each line and do whatever you want with the content as long as what you pipe back into tex is utf 8 the internal dataflow is utf8 and as the manual states, getting not utf (8 bit) out is a matter of remapping to a reserved private area in unicode (for instance, pdf literals may need 8 bit instead of utf, and that's how it's done) this keeps luatex internally clean, but permits macro writers to do what they want; it's also the principle of luatex ... provide access and points of interception but stay as clean as possible internally anyhow, good old tex was never 8 bit clean (at least not till recently and then only with natural.tcx or -8bit) also keep in mind that macro packages need to adapt to luatex and not the reverse -) Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (6)
-
Arthur Reutenauer
-
David Kastrup
-
David Kastrup
-
David@lola.quinscape.zz
-
Hans Hagen
-
Taco Hoekwater