[Dev-luatex] friday afternoon warning

Taco Hoekwater taco at elvenkind.com
Fri Jun 2 17:41:42 CEST 2006

Ok, changes committed.

Lots of work has been going on off-line over the past three days.
The current source compiles ok, and it handles utf-8 encoded
latin-1 rather well.

Input and output to the terminal and text files is all utf-8 now.
Invalid UTF-8 generates and error message (unlike xetex/aleph).

Any -translate-filename or -8bit switches are silently ignored.

I have not checked the PDF generation code like \pdfliteral at
all, so if you have 8-bit stuff in there, it will almost
certainly fail to generate valid PDF and/or crash.

Error messages may look silly, because I am not totally sure
that luatex does not utf-8 encode already utf-8 encoded data.

Some terminals may dislike the fact that luaTeX happily writes
ascii zeroes and control characters to the screen, this will
get fixed next week.

The primitives \catcode, \lccode, \uccode, \sfcode and \mathcode
all accept a 21-bit number as their first argument now. The
second argument of \lccode and \uccode can also be 21-bits.
\char also accepts a 21-bit number (of course you can input
these 21-bit numbers using backtick notation followed by a
utf-8 string). I have not done anything with \mathcode and
the other math commands yet, so they are still 8-bit.

You can have utf-8 in control sequences. This runs ok for me:

   \catcode`τ=11 \catcode`ε=11 \catcode`χ=11
   \def\τεχ{\TeX} \τεχ \csname τεχ\endcsname \bye

The bottom part of TeX's string pool should have consisted
of 2.097.152 utf-8 strings representing single characters.
Because of memory concerns, this does not actually happen,
and an offset is added to all string pool access instead.
This required changes to the C files as well, and it may
have introduced bugs I have not found yet, so be careful.

Hyphenation of the unicode base plane works in principle, but
since you cannot map characters above 255 to font code points
yet, it only really works for latin-1 (texnansi) fonts
or if you use active characters. Characters with code points
above 65535 will generate an error when used in \patterns or
\hyphenation. I do not plan to fix this soon (I doubt that
there are languages in those planes that need hyphenation)
but this 16-bit restriction will be lifted, eventually.

The luaTeX engine sets \lccode and \uccode values for
most of Basic Latin, Latin Supplement and Latin Extended-A
alphabetic characters.  This is a temporary measure only:
\lccode and \uccodes should depend on the current lanuage,
not some arbitrary global array. I will fix this later, but
this was the fastest way of making sure that the standard
ConTeXt cont-en.fmt could be made :-)

\skip0=0pt plus 1fillll is no longer an error (superfluous
'l's are simply typeset). This is actually a side-effect of a
change  I made to the scan_keyword() routine, but I am happy
with it and will not go back to 100% compatibility mode
unless Knuth himself tells me to :-)

Finally, there is a new executable that is called 'luatangle'.
This is a literal copy of 'otangle', except that is uses 21
bits for the string offset instead of 16. Ideally, this would
be a commandline option to 'tangle', but I do not want to
start a discussion about that on tex-implementors now. Using
this hack is considerably quicker.

Next week will be spent testing and consolidating these changes.

Have fun,


More information about the dev-luatex mailing list