On Tue Jan 07 2014 at 11:02:51, Javier Múgica de Rivera < javieraritz.ribadeo@gmail.com> wrote:
utf8 -> w_char's (Provide some dummy solutión for values >2^16, e.g. c & 0xFFFF)
Don't use the dummy solution; Windows uses UTF-16.
w_char's -> chars via wcstombs()
No, Windows uses UTF-16. This step is unnecessary and harmful.
You know a good deal more than me about Window's internals. I thougth that within w_char strings on Windows each character represented itself. wcstombs() is affected by the locale settings. According to the C
Yes, that's true, but filenames should be careful about locale-based conversions: 1. On Unix systems file names are byte sequences. If you represent file names as byte sequences in your application (LuaTeX does this), then you should not try to interpret the file names in any way because doing so might prevent applications from being able to access certain files. OTOH, if your application uses Unicode strings (e.g. Java, Python 3), you have to do some encoding, and that is not trivial (e.g. Python goes through some hoops to make sure that file names that cannot be represented in Unicode, e.g. invalid UTF-8 strings, are handled correctly). Interpretation of file names is generally (but not necessarily) locale-dependent; the locale on modern installations happens to default to UTF-8, but nothing stops applications from creating files with names that are not valid UTF-8 strings. The GLib functions fulfill these requirements and are interface-compatible with standard C functions, that's why I recommended them. 2. On Windows file names are sequences of 16-bit values. For all intents and purposes these are interpreted as UTF-16 strings, but in fact they can be invalid (unpaired surrogates) as well. If your application uses byte strings to store file names, to be able to access files with arbitrary but valid Unicode names UTF-8 should be used for these. If your application uses Unicode strings, they have to be converted to UTF-16. Invalid UTF-16 file names can be handled if your application stores code point sequences instead of scalar value sequences. (Many libraries don't check this.)
The opposite is true: Windows never uses locale information for filenames (it always uses UTF-16 de facto), but the locale is used on Linux.
I supposed it was used BOTH on Windows an Linux, but that on Linux it was never necessary due to it using UTF-8 naturally. I had noticed that the C run-time doc's from Visual Studio does not mention anything about encodings for fopen or other filename related fuction, but supposed the char* filename was interpreted according to the locale settings.
This is true. Note the difference: it's the OS interface, not the OS kernel, that does this interpretation. This interpretation is included for compatibility with Windows 9x, but is broken because it doesn't provide access to arbitrary Unicode strings. It should be avoided.
But... given that the string has to be passed to UTF-16 it must be interpreted somehow, isn't it?
Interpreted by whom? AFAIK neither the Windows nor the Linux kernels interpret file names in any encoding-related way. (OS X might be more complex because of its Mach/XNU/BSD/Carbon/Cocoa layering.) The difference is that they use code units of different widths (8 bits vs. 16 bits). Applications do have to interpret these strings in some meaningful way, and in order to do so, they need to decide whether to use byte strings or Unicode strings as data model.