[Dev-luatex] Filename encoding

David Kastrup dak at gnu.org
Sun Dec 29 13:24:05 CET 2013

Hans Hagen <pragma at wxs.nl> writes:

> On 12/29/2013 10:10 AM, Khaled Hosny wrote:
>> On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:
>>> but honestly, we're not living in the 1960s any more.
>> No, we are not, but Windows is.
> I always wonder why folks need comments like this. I can come up with
> linux aspects that are 1960.

Actually, any 1960 aspects of Windows are likely to have the same roots
as those of Linux: MSDOS1.0 mimicked the open/read/write/fd system calls
in addition to those taken from CP/M where the end of string character
for some system calls was '$'.  Seriously.

Since MSDOS gravitated towards being a C based platform (as opposed to
its roots in PL/M), it had sort of a split personality from its get-go.
Windows NT tried to get rid of some of the CP/M roots, instead borrowing
from VMX.  So the internals always were a bit floating between different
worlds, and not all of the compromises make a whole lot of sense.

Linux certainly has more rather than less 1960 roots than Windows, but
they are less tangled.

> I more and more tend to ignore discussions (and mails) that have this
> OS bad-this-or-that undertone. (And I've left mailings because of it.)
> If windows was that bad, then why do desktop builders try to mimick
> it.

GUI and internals are quite different.  And it's mostly a matter of
market share: at the time when serious alternatives were not driven off
the market by making the office suites (where Microsoft had its actual
muscle) ran inferior on them, systems like Nextstep, MacOS and the OS2
(what was its desktop called?, presentation manager?) were favored.

What a user is interacting with most of the time is not the desktop
environment but rather the applications.  And most certainly not the
syustem calls.

> Much is a matter of getting accustomed to.
> Anyway, if at some point utf16 had become the favourite (on linux) we
> would be in bigger problems as it can have many zero's in strings.

Since strings are then composed of "wide characters", an 8-bit zero is
not all that interesting.  The whole UTF16 approach in C, namely "wide
characters", was pretty much a disaster regarding writing programs.  And
yes, it's an entirely UNIX/C-born disaster.

The escape route of UTF8 was, I think, rather designed in Inferno(?) or
some other side project from original UNIX authors.

So Microsoft is certainly not responsible for creating the disaster that
UTF16 turned out to be.  More for embracing it and not letting go of it
when the times changed.  Windows still does not have a working UTF8
locale IIRC, even though most of the applications use it internally.

> At least windows could serve multiple gui languages rather early


> so we have to live with some aspects


> (large companies wouldn't like sudden changes and want to use programs
> decades).

That's what locales are for.  And Windows did not offer a working UTF8
locale last time I looked.

> When on windows one mixes applications in a workflow it is important
> to make sure that one doesn't get code page translations in the
> way. Anyoing indeed, but using computers is full of annoyances.

And there are also solutions.  Applications like Emacs can work with
multiple encodings well.  And that means that the primitives for
accessing files translate its internal utf-8 based encoding into the
"filename encoding".  It's perfectly fine that LuaTeX works just in
utf-8 internally, but that means that one needs a translation layer
(different for different operatings systems) that will, at least when
used from within TeX, transparently convert utf-8 to the filename

This layer is pretty much needed for every operating system: most
operating systems, even if utf-8 based, also tend to have a "canonical"
encoding for potentially composite glyphs.

David Kastrup

More information about the dev-luatex mailing list