[Dev-luatex] Very simple sample?

Taco Hoekwater taco at elvenkind.com
Sat Dec 9 12:43:25 CET 2006


Hi Javier,

Javier Bezos wrote:
> Taco:
> 
>>From the excerpt:
> 
>>From now on, whenever \LUATEX\ has to open a text file, it will call
>>the function \type{file_opener} instead of actually opening the file
>>itself. It stores the returned table in its memory, and it uses the
>>function attached to the \type{reader} label for reading lines.
> 
> 
> If I've understood correctly, this applies to files not
> yet opened, but usually the encoding is stated inside
> the file (ie, the file is already open).

That is not a problem, because *you* are the one opening the
file; it is completely under your control. Assume for a moment
if you will that all files begin a first line that contains a
statement like this:

   % encoding=iso-8859-2

Here is an example of how you could extract that information
from the files, without confusing the rest of the system
(-- is a line comment that you can use in pure .lua files):

   -- input:  a file object
   -- output: a string representing that file's encoding
   function find_file_encoding (f)
     -- read a line
     local line = f:read()
     -- reset the file offset  (not really needed in this case)
     f:seek("set",0)
     -- search for encoding
     -- %w == all alphanumerics,
     -- %- = a literal dash
     local fchar, lchar, match = line:find("encoding=([%w%-]+)")
     if fchar == nil then
       -- no encoding found, return a default
       return "iso-8859-1"
     else
       return match
     end
   end

You now have to hook this new function into 'file_opener', like so:

   function file_opener (fname)
     local f = io.open(fname)
     if f == nil then
       return nil
     else
       local encoding = find_file_encoding(f)
       local readline = function ()
         local s = "";
         local line = f:read()
         if line == nil  then
           return nil
         else
           return latin_to_utf(line, encoding)
         end
       end
       return { reader = readline }
     end
   end

Now you know the file encoding and can make decisions based
on that information (by changing 'latin_to_utf', see below).

> But where is the input encoding? Apparently this changes
> the "Unicode" representation from 8 bits (thus limited to
> the range 0-255, which is certainly latin-1) to utf-8,
> without reencoding anything (say, iso greek, koi8, macos,
> jis, etc.). I've googled for docs on unicode for lua but
> I haven't found anything particularly useful. 

An 8-bit encoding is nothing more than a mapping of 256 byte
values into unicode code points. In the simplest case, this
is an identity map, and the only difference is in file format
representation (that is what happened in my original example).

In a somewhat less trivial case, there is an array of 256 values.
Such an array could look like this:

   -- table values are borrowed from ConTeXt.
   encodings = {
     ["iso-8859-2"] = { [0] =
         0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007,
         --
         -- 240 other entries
         --
	0x0159, 0x016F, 0x00FA, 0x0171, 0x00FC, 0x00FD, 0x0163, 0x02D9
   }


Having this table, we can now rewrite the 'latin_to_utf' function:

   function latin_to_utf (line,enc)
     local s = "";
     for c in string.bytes(line) do
       if encodings[enc] ~= nil then
         s = s .. unicode.utf8.char(encodings[enc][c])
       else
         -- default is pass-through
         s = s .. unicode.utf8.char(c)
       end
     end
     return s
   end

The resulting lua code is in the attached .lua file (with the
full table, of course).

For 16-bit encodings etc. the remapping is more complex of course,
but this example should hopefully be enough to give you an idea
of how to approach it.

There is one big caveat I should warn about: because the current LuaTeX
is essentially a merge of Aleph and pdfTeX, you almost certainly need
an OTP to convert the resulting unicode values back to font encodings.
And that problem is why the font and hyphenation subsystems need to be
tackled next, before anything else. Which is what I'll start on next
monday.

Best,

Taco
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fenc-example.lua
Type: application/lua
Size: 3330 bytes
Desc: not available
Url : http://www.ntg.nl/mailman/private/dev-luatex/attachments/20061209/a7ba43cb/attachment.bin 


More information about the dev-luatex mailing list