Hi, I want to let you know that I will definately commit my work later today, because it is friday. You may want to set aside your current work tree before running svn update. I hope to have something that is at least halfway stable before I do that, but I will not make promises. Cheers, Taco
Ok, changes committed. Lots of work has been going on off-line over the past three days. The current source compiles ok, and it handles utf-8 encoded latin-1 rather well. Input and output to the terminal and text files is all utf-8 now. Invalid UTF-8 generates and error message (unlike xetex/aleph). Any -translate-filename or -8bit switches are silently ignored. I have not checked the PDF generation code like \pdfliteral at all, so if you have 8-bit stuff in there, it will almost certainly fail to generate valid PDF and/or crash. Error messages may look silly, because I am not totally sure that luatex does not utf-8 encode already utf-8 encoded data. Some terminals may dislike the fact that luaTeX happily writes ascii zeroes and control characters to the screen, this will get fixed next week. The primitives \catcode, \lccode, \uccode, \sfcode and \mathcode all accept a 21-bit number as their first argument now. The second argument of \lccode and \uccode can also be 21-bits. \char also accepts a 21-bit number (of course you can input these 21-bit numbers using backtick notation followed by a utf-8 string). I have not done anything with \mathcode and the other math commands yet, so they are still 8-bit. You can have utf-8 in control sequences. This runs ok for me: \catcode`τ=11 \catcode`ε=11 \catcode`χ=11 \def\τεχ{\TeX} \τεχ \csname τεχ\endcsname \bye The bottom part of TeX's string pool should have consisted of 2.097.152 utf-8 strings representing single characters. Because of memory concerns, this does not actually happen, and an offset is added to all string pool access instead. This required changes to the C files as well, and it may have introduced bugs I have not found yet, so be careful. Hyphenation of the unicode base plane works in principle, but since you cannot map characters above 255 to font code points yet, it only really works for latin-1 (texnansi) fonts or if you use active characters. Characters with code points above 65535 will generate an error when used in \patterns or \hyphenation. I do not plan to fix this soon (I doubt that there are languages in those planes that need hyphenation) but this 16-bit restriction will be lifted, eventually. The luaTeX engine sets \lccode and \uccode values for most of Basic Latin, Latin Supplement and Latin Extended-A alphabetic characters. This is a temporary measure only: \lccode and \uccodes should depend on the current lanuage, not some arbitrary global array. I will fix this later, but this was the fastest way of making sure that the standard ConTeXt cont-en.fmt could be made :-) \skip0=0pt plus 1fillll is no longer an error (superfluous 'l's are simply typeset). This is actually a side-effect of a change I made to the scan_keyword() routine, but I am happy with it and will not go back to 100% compatibility mode unless Knuth himself tells me to :-) Finally, there is a new executable that is called 'luatangle'. This is a literal copy of 'otangle', except that is uses 21 bits for the string offset instead of 16. Ideally, this would be a commandline option to 'tangle', but I do not want to start a discussion about that on tex-implementors now. Using this hack is considerably quicker. Next week will be spent testing and consolidating these changes. Have fun, Taco
participants (1)
-
Taco Hoekwater