[Dev-luatex] Filename encoding

Javier Múgica de Rivera javieraritz.ribadeo at gmail.com
Mon Dec 9 11:34:48 CET 2013


Hi,

I have already solved the command line issue, understanding that it is
inapropriate for Luatex to try to guess anything in this respect. But
now I have run into another problem, which has nothing to do with the
command line.

If, within a utf-8 file, say test.tex, I write

\input Canción.tex

I get this output:

>luaplainJA asd.tex
This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627)
(./test.tex
! I can't find file `Canci├│n.tex'.
l.1 \input{Canci├│n.tex}

Please type another input file name:


Luatex finds the file if I name it the 8-char representation of
Canción.tex: Canción.tex. I wondered what would happen if I write a
filename using characters not from my OS default locale, so I created
the file Ωδη.tex and within test.tex I wrote

\input Ωδη.tex

But the problem persists. The file is found if I name it Ωδη.tex,
the 8-char representation of Ωδη.tex.

This is something different from luatex assuming all its input is
utf-8. This is assuming that the underlying OS uses utf-8 for
filenames.
I searched into the code and found the relevant function, the one that
parses the argument of \input. It is in filename.w. The doc says:

@  In order to isolate the system-dependent aspects of file names, the
  @^system dependencies@> system-independent parts of \TeX\ are
expressed in terms
  of three system-dependent procedures called |begin_name|,
|more_name|, and |end_name|. In essence, if the user-specified
characters of the file name are $c_1\ldots c_n$, the
system-independent driver program does the operations

$$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n); \,|end_name|.$$

The function scan_file_name includes

void scan_file_name(void){
[...]
      if (cur_chr > 127) {
            unsigned char *bytes;
            unsigned char *thebytes;
            thebytes = uni2str((unsigned) cur_chr);
            bytes = thebytes;
            while (*bytes) {
                if (!more_name(*bytes))
                    break;
                bytes++;
            }
            xfree(thebytes);
        }
[...]
}

static boolean more_name(ASCII_code c)
{
[...]
append_char(c);         /* contribute |c| to the current string */
[...]
}

The C-standard does not have a wide-character equivalent to fopen, but
I suppose all current compilers have it. Visual Studio's is _wfopen
(arguments to _wfopen are wide-character strings. _wfopen and fopen
behave identically otherwise). Wouldn't it be easier to to use that
function and avoid breaking each character into its utf-8 8-char
representation, presuming fopen/OS will understand it properly?

Regards,
Javier

P:S.: I have been told in a previous reply that I can create files
with utf 8 filenames on windows, how can that be achieved?


More information about the dev-luatex mailing list