Re: [Dev-luatex] Filename encoding

9 Dec 2013


      2013/12/9 Javier Múgica de Rivera 
...
Hi,
I have already solved the command line issue, understanding that it is
inapropriate for Luatex to try to guess anything in this respect. But
now I have run into another problem, which has nothing to do with the
command line.
If, within a utf-8 file, say test.tex, I write
\input Canción.tex
I get this output:
...
luaplainJA asd.tex
This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627)
(./test.tex
! I can't find file `Canci├│n.tex'.
l.1 \input{Canci├│n.tex}
Please type another input file name:
Luatex finds the file if I name it the 8-char representation of
Canción.tex: CanciÃ³n.tex. I wondered what would happen if I write a
filename using characters not from my OS default locale, so I created
the file Ωδη.tex and within test.tex I wrote
\input Ωδη.tex
But the problem persists. The file is found if I name it Î©Î´Î·.tex,
the 8-char representation of Ωδη.tex.
This is something different from luatex assuming all its input is
utf-8. This is assuming that the underlying OS uses utf-8 for
filenames.
I searched into the code and found the relevant function, the one that
parses the argument of \input. It is in filename.w. The doc says:
@  In order to isolate the system-dependent aspects of file names, the
  @^system dependencies@> system-independent parts of \TeX\ are
expressed in terms
  of three system-dependent procedures called |begin_name|,
|more_name|, and |end_name|. In essence, if the user-specified
characters of the file name are $c_1\ldots c_n$, the
system-independent driver program does the operations
$$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n);
\,|end_name|.$$
The function scan_file_name includes
void scan_file_name(void){
[...]
      if (cur_chr > 127) {
            unsigned char *bytes;
            unsigned char *thebytes;
            thebytes = uni2str((unsigned) cur_chr);
            bytes = thebytes;
            while (*bytes) {
                if (!more_name(*bytes))
                    break;
                bytes++;
            }
            xfree(thebytes);
        }
[...]
}
static boolean more_name(ASCII_code c)
{
[...]
append_char(c);         /* contribute |c| to the current string */
[...]
}
The C-standard does not have a wide-character equivalent to fopen, but
I suppose all current compilers have it. Visual Studio's is _wfopen
(arguments to _wfopen are wide-character strings. _wfopen and fopen
behave identically otherwise). Wouldn't it be easier to to use that
function and avoid breaking each character into its utf-8 8-char
representation, presuming fopen/OS will understand it properly?
It's not that easy. For Windows, you need to convert the code points to
UTF-16 and then use _wfopen. For OS X and Linux, you need to convert it to
UTF-8 and then call fopen. In such cases it's often easier to only store
one version internally (e.g. the UTF-8 version) and then convert to the
system encoding at the very edge of the program, i.e., replace all calls to
fopen by a wrapper function that fans out to fopen or _wfopen depending on
the operating system. I tried this once with LuaTeX, but never finished
because I really underestimated the amount of work required. fopen is
called from dozens of places, and there are other filesystem functions to
take care about. In essence you need to replace each call to any filesystem
function. There are some drop-in wrappers available, e.g. GLib (
https://developer.gnome.org/glib/2.38/glib-File-Utilities.html#g-fopen).

Re: [Dev-luatex] Filename encoding

Philipp Stephani