2013/12/9 Javier Múgica de Rivera
Hi,
I have already solved the command line issue, understanding that it is inapropriate for Luatex to try to guess anything in this respect. But now I have run into another problem, which has nothing to do with the command line.
If, within a utf-8 file, say test.tex, I write
\input Canción.tex
I get this output:
luaplainJA asd.tex This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627) (./test.tex ! I can't find file `Canci├│n.tex'. l.1 \input{Canci├│n.tex}
Please type another input file name:
Luatex finds the file if I name it the 8-char representation of Canción.tex: Canción.tex. I wondered what would happen if I write a filename using characters not from my OS default locale, so I created the file Ωδη.tex and within test.tex I wrote
\input Ωδη.tex
But the problem persists. The file is found if I name it Ωδη.tex, the 8-char representation of Ωδη.tex.
This is something different from luatex assuming all its input is utf-8. This is assuming that the underlying OS uses utf-8 for filenames. I searched into the code and found the relevant function, the one that parses the argument of \input. It is in filename.w. The doc says:
@ In order to isolate the system-dependent aspects of file names, the @^system dependencies@> system-independent parts of \TeX\ are expressed in terms of three system-dependent procedures called |begin_name|, |more_name|, and |end_name|. In essence, if the user-specified characters of the file name are $c_1\ldots c_n$, the system-independent driver program does the operations
$$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n); \,|end_name|.$$
The function scan_file_name includes
void scan_file_name(void){ [...] if (cur_chr > 127) { unsigned char *bytes; unsigned char *thebytes; thebytes = uni2str((unsigned) cur_chr); bytes = thebytes; while (*bytes) { if (!more_name(*bytes)) break; bytes++; } xfree(thebytes); } [...] }
static boolean more_name(ASCII_code c) { [...] append_char(c); /* contribute |c| to the current string */ [...] }
The C-standard does not have a wide-character equivalent to fopen, but I suppose all current compilers have it. Visual Studio's is _wfopen (arguments to _wfopen are wide-character strings. _wfopen and fopen behave identically otherwise). Wouldn't it be easier to to use that function and avoid breaking each character into its utf-8 8-char representation, presuming fopen/OS will understand it properly?
It's not that easy. For Windows, you need to convert the code points to UTF-16 and then use _wfopen. For OS X and Linux, you need to convert it to UTF-8 and then call fopen. In such cases it's often easier to only store one version internally (e.g. the UTF-8 version) and then convert to the system encoding at the very edge of the program, i.e., replace all calls to fopen by a wrapper function that fans out to fopen or _wfopen depending on the operating system. I tried this once with LuaTeX, but never finished because I really underestimated the amount of work required. fopen is called from dozens of places, and there are other filesystem functions to take care about. In essence you need to replace each call to any filesystem function. There are some drop-in wrappers available, e.g. GLib ( https://developer.gnome.org/glib/2.38/glib-File-Utilities.html#g-fopen).