Philipp Stephani
Mon Dec 9 20:15:35 CET 2013

2013/12/9 Javier Múgica de Rivera <javieraritz.ribadeo at gmail.com>

> Hi,
> I have already solved the command line issue, understanding that it is
> inapropriate for Luatex to try to guess anything in this respect. But
> now I have run into another problem, which has nothing to do with the
> command line.
> If, within a utf-8 file, say test.tex, I write
> \input Canción.tex
> I get this output:
> >luaplainJA asd.tex
> This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627)
> (./test.tex
> ! I can't find file `Canci├│n.tex'.
> l.1 \input{Canci├│n.tex}
> Please type another input file name:
> Luatex finds the file if I name it the 8-char representation of
> Canción.tex: Canción.tex. I wondered what would happen if I write a
> filename using characters not from my OS default locale, so I created
> the file Ωδη.tex and within test.tex I wrote
> \input Ωδη.tex
> But the problem persists. The file is found if I name it Ωδη.tex,
> the 8-char representation of Ωδη.tex.
> This is something different from luatex assuming all its input is
> utf-8. This is assuming that the underlying OS uses utf-8 for
> filenames.
> I searched into the code and found the relevant function, the one that
> parses the argument of \input. It is in filename.w. The doc says:
> @  In order to isolate the system-dependent aspects of file names, the
>   @^system dependencies@> system-independent parts of \TeX\ are
> expressed in terms
>   of three system-dependent procedures called |begin_name|,
> |more_name|, and |end_name|. In essence, if the user-specified
> characters of the file name are $c_1\ldots c_n$, the
> system-independent driver program does the operations
> $$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n);
> \,|end_name|.$$
> The function scan_file_name includes
> void scan_file_name(void){
> [...]
>       if (cur_chr > 127) {
>             unsigned char *bytes;
>             unsigned char *thebytes;
>             thebytes = uni2str((unsigned) cur_chr);
>             bytes = thebytes;
>             while (*bytes) {
>                 if (!more_name(*bytes))
>                     break;
>                 bytes++;
>             }
>             xfree(thebytes);
>         }
> [...]
> }
> static boolean more_name(ASCII_code c)
> {
> [...]
> append_char(c);         /* contribute |c| to the current string */
> [...]
> }
> The C-standard does not have a wide-character equivalent to fopen, but
> I suppose all current compilers have it. Visual Studio's is _wfopen
> (arguments to _wfopen are wide-character strings. _wfopen and fopen
> behave identically otherwise). Wouldn't it be easier to to use that
> function and avoid breaking each character into its utf-8 8-char
> representation, presuming fopen/OS will understand it properly?
It's not that easy. For Windows, you need to convert the code points to
UTF-16 and then use _wfopen. For OS X and Linux, you need to convert it to
UTF-8 and then call fopen. In such cases it's often easier to only store
one version internally (e.g. the UTF-8 version) and then convert to the
system encoding at the very edge of the program, i.e., replace all calls to
fopen by a wrapper function that fans out to fopen or _wfopen depending on
the operating system. I tried this once with LuaTeX, but never finished
because I really underestimated the amount of work required. fopen is
called from dozens of places, and there are other filesystem functions to
take care about. In essence you need to replace each call to any filesystem
function. There are some drop-in wrappers available, e.g. GLib (
