New subject: token_filter semantics (was: Utf-8 too dominant?)

26 Mar 2007

      Hi,

I just tried doing
luatex -ini latex.ltx

with a freshly checked out LuaTeX.  The result is
This is luaTeX, Version 3.141592-snapshot-2007032611 (Web2C 7.5.6) (INITEX)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/latex.ltx
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/texsys.cfg)
./texsys.aux found

\@currdir set to: ./.

Assuming \openin and \input 
have the same search path.

Defining UNIX/DOS style filename parser.

catcodes, registers, compatibility for TeX 2,  parameters,
LaTeX2e <2005/12/01>
hacks, control, par, spacing, files, font encodings, lengths,
====================================

Local config file fonttext.cfg used

====================================
(/usr/local/texlive/2007/texmf-dist/tex/cslatex/base/fonttext.cfg
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/omlenc.def)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/t1enc.def)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1enc.def)
(/usr/local/texlive/2007/texmf-dist/tex/latex/cslatex/il2enc.def)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/omsenc.def)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/t1cmr.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmr.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/cslatex/il2cmr.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmss.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/ot1cmtt.fd))
====================================

Local config file fontmath.cfg used

====================================
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/fontmath.cfg
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/fontmath.ltx
=== Don't modify this file, use a .cfg file instead ===

(/usr/local/texlive/2007/texmf-dist/tex/latex/base/omlcmm.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/omscmsy.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/omxcmex.fd)
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/ucmr.fd)))
====================================

Local config file preload.cfg used

=====================================
(/usr/local/texlive/2007/texmf/tex/generic/config/preload.cfg
(/usr/local/texlive/2007/texmf-dist/tex/latex/base/preload.ltx)) page nos.,
x-ref, environments, center, verbatim, math definitions, boxes, title,
sectioning, contents, floats, footnotes, index, bibliography, output,
! Buffer contains an invalid utf-8 sequence.
l.7804   \lccode`\
                  �=`\i    % dotted I
? 
! Pool contains an invalid utf-8 sequence
.
l.7804   \lccode`\�
                   =`\i    % dotted I
? 
! Buffer contains an invalid utf-8 sequence.
l.7805   \uccode`\
                  �=`\^^9d % dotted I
? 
! Pool contains an invalid utf-8 sequence
.
l.7805   \uccode`\�
                   =`\^^9d % dotted I
? 
! Buffer contains an invalid utf-8 sequence.
l.7805   \uccode`\�=`\
                      � % dotted I
? 
! Pool contains an invalid utf-8 sequence

[...]

Now the sequences in question are:

\ifnum\inputlineno=\m@ne\else
  \lccode`\^^9d=`\i    % dotted I
  \uccode`\^^9d=`\^^9d % dotted I
  \lccode`\^^9e=`\^^9e % d-bar
  \uccode`\^^9e=`\^^d0 % d-bar
\fi

In short: the buffer does not contain any illegal utf-8 sequence at
all!  latex.ltx consists _solely_ of ASCII characters in the range
0-127.  Instead, LuaTeX barfs on "\^^9d" and similar ASCII
_transliterations_ of characters which happen to be legal _characters_
in Unicode (though not legal _bytes_ in utf-8).

(/usr/local/texlive/2007/texmf-dist/tex/generic/xu-hyphen/xu-bahyph.tex
! Text line contains an invalid utf-8 sequence.
l.17   \lccode`\
                �=0
? 
! Text line contains an invalid utf-8 sequence.
l.20     \ifnum\lccode`\
                        �=0 % if bahyph.tex didn't change this,
? 

Again, the input file is purely ASCII, in this case
\begingroup

\expandafter\ifx\csname XeTeXrevision\endcsname\relax
\else

  % The standard bahyph.tex is plain ASCII, so directly readable;
  % but we want to add patterns for n-tilde (^^f1), as generated by
  % bahyph.sh if the "latin1" option is given.
  % However, if a "latin1" version of bahyph was already present,
  % these would be duplicate patterns.
  % We'll watch the \lccode of ^^f1 so as to detect this.
  \lccode`\^^f1=0
  \let\PATTERNS=\patterns
  \def\patterns{%
    \ifnum\lccode`\^^f1=0 % if bahyph.tex didn't change this,
      \lccode`\^^f1=`\^^f1 % then we can load the extra patterns here
      \PATTERNS{1^^f1a 1^^f1e 1^^f1o 1^^f1i 1^^f1u}%
    \fi
    \PATTERNS
  }

\fi

So we have error messages about "pool", "buffer" and "text line"
containing invalid utf-8 sequences, when the input actually is just
ASCII.

-- 
David Kastrup

Utf-8 too dominant?

David Kastrup

Arthur Reutenauer

Taco Hoekwater

David Kastrup

Taco Hoekwater

David Kastrup

Taco Hoekwater

David＠lola.quinscape.zz

David Kastrup

Hans Hagen

David Kastrup

Taco Hoekwater

David Kastrup

Hans Hagen

tags

participants (6)