Filename encoding

Javier Múgica de Rivera

9 Dec 2013 9 Dec '13

11:34 a.m.

Hi, I have already solved the command line issue, understanding that it is inapropriate for Luatex to try to guess anything in this respect. But now I have run into another problem, which has nothing to do with the command line. If, within a utf-8 file, say test.tex, I write \input Canción.tex I get this output:

...

luaplainJA asd.tex This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627) (./test.tex ! I can't find file `Canci├│n.tex'. l.1 \input{Canci├│n.tex}

Please type another input file name: Luatex finds the file if I name it the 8-char representation of Canción.tex: CanciÃ³n.tex. I wondered what would happen if I write a filename using characters not from my OS default locale, so I created the file Ωδη.tex and within test.tex I wrote \input Ωδη.tex But the problem persists. The file is found if I name it Î©Î´Î·.tex, the 8-char representation of Ωδη.tex. This is something different from luatex assuming all its input is utf-8. This is assuming that the underlying OS uses utf-8 for filenames. I searched into the code and found the relevant function, the one that parses the argument of \input. It is in filename.w. The doc says: @ In order to isolate the system-dependent aspects of file names, the @^system dependencies@> system-independent parts of \TeX\ are expressed in terms of three system-dependent procedures called |begin_name|, |more_name|, and |end_name|. In essence, if the user-specified characters of the file name are $c_1\ldots c_n$, the system-independent driver program does the operations $$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n); \,|end_name|.$$ The function scan_file_name includes void scan_file_name(void){ [...] if (cur_chr > 127) { unsigned char *bytes; unsigned char *thebytes; thebytes = uni2str((unsigned) cur_chr); bytes = thebytes; while (*bytes) { if (!more_name(*bytes)) break; bytes++; } xfree(thebytes); } [...] } static boolean more_name(ASCII_code c) { [...] append_char(c); /* contribute |c| to the current string */ [...] } The C-standard does not have a wide-character equivalent to fopen, but I suppose all current compilers have it. Visual Studio's is _wfopen (arguments to _wfopen are wide-character strings. _wfopen and fopen behave identically otherwise). Wouldn't it be easier to to use that function and avoid breaking each character into its utf-8 8-char representation, presuming fopen/OS will understand it properly? Regards, Javier P:S.: I have been told in a previous reply that I can create files with utf 8 filenames on windows, how can that be achieved?

Show replies by date

Hans Hagen

9 Dec 9 Dec

12:03 p.m.

On 12/9/2013 11:34 AM, Javier Múgica de Rivera wrote:

...

l.1 \input{Canci├│n.tex} ^^ Luatex finds the file if I name it the 8-char representation of Canción.tex: CanciÃ³n.tex. I wondered what would happen if I write a ^^

You really have to make sure that you don't let the OS apply a locale to the name, so if e.g. you copy the string Canción.tex to the clipboard it can go from unicode to some code page and back if you create files from (say) lua all goes ok as then no translation takes place (maybe it helps to set the code page to 65001) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Javier Múgica de Rivera

12:22 p.m.

2013/12/9, Hans Hagen :

...

You really have to make sure that you don't let the OS apply a locale to the name, so if e.g. you copy the string Canción.tex to the clipboard it can go from unicode to some code page and back

I noticed that pasting the contents of the clipboard may change the encoding. That is why I created the file Ωδη.tex, which must be stored in some Unicode encoding. Given that it is not found by luatex, it is not UTF-8. I suppose it is pure 16-bit Unicode, as Mindaugas pointed in some previous reply.

...

if you create files from (say) lua all goes ok as then no translation takes place

I had forgotten files created by Lua... It still seems to me that the more robuts approach is to use UTF-16 for filenames when interacting with the system, when creating them and/or opening for reading. The only funciton involved is fopen, to be replaced by _wfopen or a similar compiler-dependet name.

Javier Múgica de Rivera

12:26 p.m.

...

l.1 \input{Canci├│n.tex} ^^ Luatex finds the file if I name it the 8-char representation of Canción.tex: CanciÃ³n.tex. I wondered what would happen if I write a ^^

Sorry, I had missed your ^^'s. The terminal output, Canci├│n, is irrelevant, the fact is that luatex hasn't found the file. The log file displays the name properly.

Philipp Stephani

8:15 p.m.

2013/12/9 Javier Múgica de Rivera

...

Hi,

I have already solved the command line issue, understanding that it is inapropriate for Luatex to try to guess anything in this respect. But now I have run into another problem, which has nothing to do with the command line.

If, within a utf-8 file, say test.tex, I write

\input Canción.tex

I get this output:

...
luaplainJA asd.tex This is LuaTeX, Version beta-0.76.0-2013052306 (rev 4627) (./test.tex ! I can't find file `Canci├│n.tex'. l.1 \input{Canci├│n.tex}

Please type another input file name:

Luatex finds the file if I name it the 8-char representation of Canción.tex: CanciÃ³n.tex. I wondered what would happen if I write a filename using characters not from my OS default locale, so I created the file Ωδη.tex and within test.tex I wrote

\input Ωδη.tex

But the problem persists. The file is found if I name it Î©Î´Î·.tex, the 8-char representation of Ωδη.tex.

This is something different from luatex assuming all its input is utf-8. This is assuming that the underlying OS uses utf-8 for filenames. I searched into the code and found the relevant function, the one that parses the argument of \input. It is in filename.w. The doc says:

@ In order to isolate the system-dependent aspects of file names, the @^system dependencies@> system-independent parts of \TeX\ are expressed in terms of three system-dependent procedures called |begin_name|, |more_name|, and |end_name|. In essence, if the user-specified characters of the file name are $c_1\ldots c_n$, the system-independent driver program does the operations

$$|begin_name|;\,|more_name|(c_1);\,\ldots\,;\,|more_name|(c_n); \,|end_name|.$$

The function scan_file_name includes

void scan_file_name(void){ [...] if (cur_chr > 127) { unsigned char *bytes; unsigned char *thebytes; thebytes = uni2str((unsigned) cur_chr); bytes = thebytes; while (*bytes) { if (!more_name(*bytes)) break; bytes++; } xfree(thebytes); } [...] }

static boolean more_name(ASCII_code c) { [...] append_char(c); /* contribute |c| to the current string */ [...] }

The C-standard does not have a wide-character equivalent to fopen, but I suppose all current compilers have it. Visual Studio's is _wfopen (arguments to _wfopen are wide-character strings. _wfopen and fopen behave identically otherwise). Wouldn't it be easier to to use that function and avoid breaking each character into its utf-8 8-char representation, presuming fopen/OS will understand it properly?

It's not that easy. For Windows, you need to convert the code points to UTF-16 and then use _wfopen. For OS X and Linux, you need to convert it to UTF-8 and then call fopen. In such cases it's often easier to only store one version internally (e.g. the UTF-8 version) and then convert to the system encoding at the very edge of the program, i.e., replace all calls to fopen by a wrapper function that fans out to fopen or _wfopen depending on the operating system. I tried this once with LuaTeX, but never finished because I really underestimated the amount of work required. fopen is called from dozens of places, and there are other filesystem functions to take care about. In essence you need to replace each call to any filesystem function. There are some drop-in wrappers available, e.g. GLib ( https://developer.gnome.org/glib/2.38/glib-File-Utilities.html#g-fopen).

Javier Múgica de Rivera

12 Dec 12 Dec

11:44 a.m.

2013/12/9, Philipp Stephani :

...

It's not that easy. For Windows, you need to convert the code points to UTF-16...

or pass it wthout conversion. Characters beyond the basic multilingual plane in filenames need not be allowed.

...

and then use _wfopen. For OS X and Linux, you need to convert it to UTF-8 and then call fopen. In such cases it's often easier to only store one version internally (e.g. the UTF-8 version)

or just the string of code points as it is stored internally by luatex (think it is a string of int or unsigned integers, can't remember now).

...

and then convert to the system encoding at the very edge of the program, i.e., replace all calls to fopen by a wrapper function that fans out to fopen or _wfopen depending on the operating system. I tried this once with LuaTeX, but never finished because I really underestimated the amount of work required. fopen is called from dozens of places, and there are other filesystem functions to take care about. In essence you need to replace each call to any filesystem function. There are some drop-in wrappers available, e.g. GLib ( https://developer.gnome.org/glib/2.38/glib-File-Utilities.html#g-fopen).

Philipp Maximilian Stephani

29 Dec 29 Dec

2:07 a.m.

On Donnerstag, 12. Dezember 2013 11:44:52, Javier Múgica de Rivera < javieraritz.ribadeo@gmail.com> wrote: 2013/12/9, Philipp Stephani :

...

It's not that easy. For Windows, you need to convert the code points to UTF-16...

or pass it wthout conversion. Characters beyond the basic multilingual plane in filenames need not be allowed I think they should be allowed, they are normal Unicode characters, and the BMP isn't special.

...

and then use _wfopen. For OS X and Linux, you need to convert it to UTF-8 and then call fopen. In such cases it's often easier to only store one version internally (e.g. the UTF-8 version)

or just the string of code points as it is stored internally by luatex (think it is a string of int or unsigned integers, can't remember now). Not sure whether LuaTeX stores file names as code point array. This would, however, preclude byte strings that are not valid Unicode strings (but are legal on Unix).

...

and then convert to the system encoding at the very edge of the program, i.e., replace all calls to fopen by a wrapper function that fans out to fopen or _wfopen depending on the operating system. I tried this once with LuaTeX, but never finished because I really underestimated the amount of work required. fopen is called from dozens of places, and there are other filesystem functions to take care about. In essence you need to replace each call to any filesystem function. There are some drop-in wrappers available, e.g. GLib ( https://developer.gnome.org/glib/2.38/glib-File-Utilities.html#g-fopen).

I thought, as you had once done, that the amount of work required was small. In any case, this is something that ought to have been programmed from the onset but has been left undone till now. To call it by its name, this is a bug. Writing \input whateveráéè.tex and luatex no finding the file is a bug. As a Spanish speaker this is not a serious issue for me, but I wonder how people using different scripts, e.g. greek, russian, hebrew, etc. and using Windows manage to get around this problem. Is it that they just don't \input files I totally agree that this is a bug. I think not supporting Unicode when the underlying system supports Unicode should always be treated as a bug. Conventional wisdom says to only use ASCII characters in file names to be portable, but honestly, we're not living in the 1960s any more. Here is an old thread about the same problem: http://tug.org/pipermail/tex-live/2011-May/029059.html _______________________________________________ dev-luatex mailing list dev-luatex@ntg.nl http://www.ntg.nl/mailman/listinfo/dev-luatex

Khaled Hosny

10:10 a.m.

On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:

...

but honestly, we're not living in the 1960s any more.

No, we are not, but Windows is.

Hans Hagen

12:44 p.m.

On 12/29/2013 10:10 AM, Khaled Hosny wrote:

...

On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:

...
but honestly, we're not living in the 1960s any more.

No, we are not, but Windows is.

I always wonder why folks need comments like this. I can come up with linux aspects that are 1960. I more and more tend to ignore discussions (and mails) that have this OS bad-this-or-that undertone. (And I've left mailings because of it.) If windows was that bad, then why do desktop builders try to mimick it. Much is a matter of getting accustomed to. Anyway, if at some point utf16 had become the favourite (on linux) we would be in bigger problems as it can have many zero's in strings. At least windows could serve multiple gui languages rather early so we have to live with some aspects (large companies wouldn't like sudden changes and want to use programs decades). Fwiw: it's comparable to (mysql) database content where different assumptions about what bytes represent can give weird side effects. It's about mutual agreements. Lua(tex) is rather neutral with respect to what bytes go into a filename: if i save some data using an utf8 filename (from lua for instance) i can perfectly well reload that file. Some applications will show proper (utf8) names, others, like 'dir' in the console, will show bytes as e.g. latin. Not much different from what one gets when one logs into a remote machine with a different terminal setup. Which reminds me: last week i entered an lua interactive console on ubuntu and magically ^3 was turned into this superscript unicode 3 characters ... so, talking of a mess up ... to some extend I can understand such default behaviour so I'll live with it. It's cut 'n paste and assumptions of other applications that (at least on windows) can turn something utf8 into something looking weird. It's really not much different from typesetting an utf8 encoded document in a tex that expects 8 bit texnansi encoding. The typeset stream looks weird but in fact is honest utf8 visualized. Of course we could introduce an abstract filename object (including all these attributes that relates to file) but it's not really a solution. Simply converting utf8 encoded filenames into utf16 doesn't work out well because in between we use C-strings and these have this '60 properties of being zero terminated so in practice one ends up with utf16 names clipped to length 1. When on windows one mixes applications in a workflow it is important to make sure that one doesn't get code page translations in the way. Anyoing indeed, but using computers is full of annoyances. You don't want to know what troubles we sometimes have with graphics coming from apple infrastructures to linux infrastructure where users. Filenames is always a bit of an issue. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Khaled Hosny

12:53 p.m.

On Sun, Dec 29, 2013 at 12:44:35PM +0100, Hans Hagen wrote:

...

On 12/29/2013 10:10 AM, Khaled Hosny wrote:

...
On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:

...
but honestly, we're not living in the 1960s any more.

No, we are not, but Windows is.

I always wonder why folks need comments like this.

If Windows is the only OS at odds in this specific situation, then it is LuaTeX’s fault, right? I can understand OP frustration, but repeatedly blaming LuaTeX [developers] for the deficiencies of his OS of choice does not seem very productive to me, and can annoying. Instead of telling what LuaTeX should and shouldn’t have done, he can either 1) submit a patch 2) switch to a not-so-1960s OS 3) switch to a not-so-1960s application 4) just live with it. Regards, Khaled

David Kastrup

1:24 p.m.

Hans Hagen writes:

...

On 12/29/2013 10:10 AM, Khaled Hosny wrote:

...
On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:

...
but honestly, we're not living in the 1960s any more.

No, we are not, but Windows is.

I always wonder why folks need comments like this. I can come up with linux aspects that are 1960.

Actually, any 1960 aspects of Windows are likely to have the same roots as those of Linux: MSDOS1.0 mimicked the open/read/write/fd system calls in addition to those taken from CP/M where the end of string character for some system calls was '$'. Seriously. Since MSDOS gravitated towards being a C based platform (as opposed to its roots in PL/M), it had sort of a split personality from its get-go. Windows NT tried to get rid of some of the CP/M roots, instead borrowing from VMX. So the internals always were a bit floating between different worlds, and not all of the compromises make a whole lot of sense. Linux certainly has more rather than less 1960 roots than Windows, but they are less tangled.

...

I more and more tend to ignore discussions (and mails) that have this OS bad-this-or-that undertone. (And I've left mailings because of it.) If windows was that bad, then why do desktop builders try to mimick it.

GUI and internals are quite different. And it's mostly a matter of market share: at the time when serious alternatives were not driven off the market by making the office suites (where Microsoft had its actual muscle) ran inferior on them, systems like Nextstep, MacOS and the OS2 (what was its desktop called?, presentation manager?) were favored. What a user is interacting with most of the time is not the desktop environment but rather the applications. And most certainly not the syustem calls.

...

Much is a matter of getting accustomed to.

Anyway, if at some point utf16 had become the favourite (on linux) we would be in bigger problems as it can have many zero's in strings.

Since strings are then composed of "wide characters", an 8-bit zero is not all that interesting. The whole UTF16 approach in C, namely "wide characters", was pretty much a disaster regarding writing programs. And yes, it's an entirely UNIX/C-born disaster. The escape route of UTF8 was, I think, rather designed in Inferno(?) or some other side project from original UNIX authors. So Microsoft is certainly not responsible for creating the disaster that UTF16 turned out to be. More for embracing it and not letting go of it when the times changed. Windows still does not have a working UTF8 locale IIRC, even though most of the applications use it internally.

...

At least windows could serve multiple gui languages rather early

Yes.

...

so we have to live with some aspects

No.

...

(large companies wouldn't like sudden changes and want to use programs decades).

That's what locales are for. And Windows did not offer a working UTF8 locale last time I looked.

...

When on windows one mixes applications in a workflow it is important to make sure that one doesn't get code page translations in the way. Anyoing indeed, but using computers is full of annoyances.

And there are also solutions. Applications like Emacs can work with multiple encodings well. And that means that the primitives for accessing files translate its internal utf-8 based encoding into the "filename encoding". It's perfectly fine that LuaTeX works just in utf-8 internally, but that means that one needs a translation layer (different for different operatings systems) that will, at least when used from within TeX, transparently convert utf-8 to the filename encoding. This layer is pretty much needed for every operating system: most operating systems, even if utf-8 based, also tend to have a "canonical" encoding for potentially composite glyphs. -- David Kastrup

Javier Múgica de Rivera

4 Jan 4 Jan

9:29 p.m.

2013/12/29, David Kastrup :

...

That's what locales are for. And Windows did not offer a working UTF8 locale last time I looked.

Indeed, it does not. From de documentation of setlocale: The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.

...

And there are also solutions. Applications like Emacs can work with multiple encodings well. And that means that the primitives for accessing files translate its internal utf-8 based encoding into the "filename encoding". It's perfectly fine that LuaTeX works just in utf-8 internally, but that means that one needs a translation layer (different for different operatings systems) that will, at least when used from within TeX, transparently convert utf-8 to the filename encoding.

This layer is pretty much needed for every operating system: most operating systems, even if utf-8 based, also tend to have a "canonical" encoding for potentially composite glyphs.

-- David Kastrup

This layer is certainly needed for a program inteded to be used worldwide on as much operating systems as possible (or on as many as there are volunteers to port it to). This will work, I think, for OS that do not support UTF-8. utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF) w_char's -> chars via wcstombs() wcstombs() is affected by the locale settings. According to the C language specification setlocale( LC_ALL, "" ); //Sets the locale to the native environment. In Windows this is the user-default ANSI code page obtained from the operating system. In Windows there is also setlocale( LC_ALL, ".OCP" ); //Sets the locale to the current OEM code page obtained from the operating system. I had forgotten that I had once programmed this for myself a long time ago. Just discovered it by removing an #include I had thought superfluous. I dint't store the string internally as utf8 but rather ar w_char's, but the concept is the same. If fopen is to be used the string should be transformed to the operating system locale. This is obvious if you are using Windows. In Linux may not be that obvious since all implementations always use UTF-8, or so I think. But Luatex is intended to work on Windows, isn't it? -- Javier Múgica

Philipp Maximilian Stephani

5 Jan 5 Jan

1:30 a.m.

On Sat Jan 04 2014 at 21:29:26, Javier Múgica de Rivera < javieraritz.ribadeo@gmail.com> wrote:

...

This layer is certainly needed for a program inteded to be used worldwide on as much operating systems as possible (or on as many as there are volunteers to port it to).

It's already required for Windows plus any Unix system (e.g. Windows + OS X).

...

This will work, I think, for OS that do not support UTF-8.

The only such OS is Windows.

...

utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF)

Don't use the dummy solution; Windows uses UTF-16.

...

w_char's -> chars via wcstombs()

No, Windows uses UTF-16. This step is unnecessary and harmful.

...

wcstombs() is affected by the locale settings. According to the C language specification

setlocale( LC_ALL, "" ); //Sets the locale to the native environment.

In Windows this is the user-default ANSI code page obtained from the operating system. In Windows there is also

setlocale( LC_ALL, ".OCP" ); //Sets the locale to the current OEM code page obtained from the operating system.

On Unix systems locales to some extend define the encoding being used. Windows, however, always uses UTF-16, and using locales for encoding issues does not make sense.

...

If fopen is to be used the string should be transformed to the operating system locale. This is obvious if you are using Windows. In Linux may not be that obvious since all implementations always use UTF-8, or so I think.

The opposite is true: Windows never uses locale information for filenames (it always uses UTF-16 de facto), but the locale is used on Linux. Just like mentioned above in the thread, the correct solution is to never use fopen (or other filesystem function from the C library), but always wrapper functions like g_fopen. As an experiment, I replaced all fopen calls in LuaTeX with g_fopen, but I can't figure out how to link against GLib yet :-(

Javier Múgica de Rivera

7 Jan 7 Jan

11:02 a.m.

...

...
utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF)

Don't use the dummy solution; Windows uses UTF-16.

...
w_char's -> chars via wcstombs()

No, Windows uses UTF-16. This step is unnecessary and harmful.

You know a good deal more than me about Window's internals. I thougth that within w_char strings on Windows each character represented itself. wcstombs() is affected by the locale settings. According to the C

...

The opposite is true: Windows never uses locale information for filenames (it always uses UTF-16 de facto), but the locale is used on Linux.

I supposed it was used BOTH on Windows an Linux, but that on Linux it was never necessary due to it using UTF-8 naturally. I had noticed that the C run-time doc's from Visual Studio does not mention anything about encodings for fopen or other filename related fuction, but supposed the char* filename was interpreted according to the locale settings. But... given that the string has to be passed to UTF-16 it must be interpreted somehow, isn't it? (You may not forward the reply to the list, since this question is rather unrelated to luatex but to Windows internals, or you may not answer at all). Regards, -- Javier M.

David Kastrup

11:19 a.m.

Javier Múgica de Rivera writes:

...

...
...
utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF)

Don't use the dummy solution; Windows uses UTF-16.

...
w_char's -> chars via wcstombs()

No, Windows uses UTF-16. This step is unnecessary and harmful.

You know a good deal more than me about Window's internals. I thougth that within w_char strings on Windows each character represented itself. wcstombs() is affected by the locale settings. According to the C

...
The opposite is true: Windows never uses locale information for filenames (it always uses UTF-16 de facto), but the locale is used on Linux.

I supposed it was used BOTH on Windows an Linux, but that on Linux it was never necessary due to it using UTF-8 naturally.

Linux does not use "UTF-8" naturally as far as I can tell. It uses a null-terminated byte sequence. It's the job of the application to encode that byte sequence in a manner where files will not get lost. There might be some "external" file systems (like CD file systems or vfat) with a translation layer for file names that assume that this byte sequence is UTF-8 as opposed to the sequence used on the disk. But the "native" file systems will likely be transparent, and file systems written on a basically latin-1 system will show strange characters in file names when used on a basically utf-8 system. -- David Kastrup

Philipp Maximilian Stephani

8 Jan 8 Jan

8:15 p.m.

On Tue Jan 07 2014 at 11:02:51, Javier Múgica de Rivera < javieraritz.ribadeo@gmail.com> wrote:

...

...
...
utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF)

Don't use the dummy solution; Windows uses UTF-16.

...
w_char's -> chars via wcstombs()

No, Windows uses UTF-16. This step is unnecessary and harmful.

You know a good deal more than me about Window's internals. I thougth that within w_char strings on Windows each character represented itself. wcstombs() is affected by the locale settings. According to the C

Yes, that's true, but filenames should be careful about locale-based conversions: 1. On Unix systems file names are byte sequences. If you represent file names as byte sequences in your application (LuaTeX does this), then you should not try to interpret the file names in any way because doing so might prevent applications from being able to access certain files. OTOH, if your application uses Unicode strings (e.g. Java, Python 3), you have to do some encoding, and that is not trivial (e.g. Python goes through some hoops to make sure that file names that cannot be represented in Unicode, e.g. invalid UTF-8 strings, are handled correctly). Interpretation of file names is generally (but not necessarily) locale-dependent; the locale on modern installations happens to default to UTF-8, but nothing stops applications from creating files with names that are not valid UTF-8 strings. The GLib functions fulfill these requirements and are interface-compatible with standard C functions, that's why I recommended them. 2. On Windows file names are sequences of 16-bit values. For all intents and purposes these are interpreted as UTF-16 strings, but in fact they can be invalid (unpaired surrogates) as well. If your application uses byte strings to store file names, to be able to access files with arbitrary but valid Unicode names UTF-8 should be used for these. If your application uses Unicode strings, they have to be converted to UTF-16. Invalid UTF-16 file names can be handled if your application stores code point sequences instead of scalar value sequences. (Many libraries don't check this.)

...

...
The opposite is true: Windows never uses locale information for filenames (it always uses UTF-16 de facto), but the locale is used on Linux.

I supposed it was used BOTH on Windows an Linux, but that on Linux it was never necessary due to it using UTF-8 naturally. I had noticed that the C run-time doc's from Visual Studio does not mention anything about encodings for fopen or other filename related fuction, but supposed the char* filename was interpreted according to the locale settings.

This is true. Note the difference: it's the OS interface, not the OS kernel, that does this interpretation. This interpretation is included for compatibility with Windows 9x, but is broken because it doesn't provide access to arbitrary Unicode strings. It should be avoided.

...

But... given that the string has to be passed to UTF-16 it must be interpreted somehow, isn't it?

Interpreted by whom? AFAIK neither the Windows nor the Linux kernels interpret file names in any encoding-related way. (OS X might be more complex because of its Mach/XNU/BSD/Carbon/Cocoa layering.) The difference is that they use code units of different widths (8 bits vs. 16 bits). Applications do have to interpret these strings in some meaningful way, and in order to do so, they need to decide whether to use byte strings or Unicode strings as data model.

4191

Age (days ago)

4221

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

David Kastrup
Hans Hagen
Javier Múgica de Rivera
Khaled Hosny
Philipp Maximilian Stephani
Philipp Stephani

Filename encoding

tags

participants (6)