[Dev-luatex] Filename encoding

Hans Hagen pragma at wxs.nl
Sun Dec 29 12:44:35 CET 2013

On 12/29/2013 10:10 AM, Khaled Hosny wrote:
> On Sun, Dec 29, 2013 at 01:07:15AM +0000, Philipp Maximilian Stephani wrote:
>> but honestly, we're not living in the 1960s any more.
> No, we are not, but Windows is.

I always wonder why folks need comments like this. I can come up with 
linux aspects that are 1960. I more and more tend to ignore discussions 
(and mails) that have this OS bad-this-or-that undertone. (And I've left 
mailings because of it.) If windows was that bad, then why do desktop 
builders try to mimick it. Much is a matter of getting accustomed to.

Anyway, if at some point utf16 had become the favourite (on linux) we 
would be in bigger problems as it can have many zero's in strings. At 
least windows could serve multiple gui languages rather early so we have 
to live with some aspects (large companies wouldn't like sudden changes 
and want to use programs decades). Fwiw: it's comparable to (mysql) 
database content where different assumptions about what bytes represent 
can give weird side effects. It's about mutual agreements.

Lua(tex) is rather neutral with respect to what bytes go into a 
filename: if i save some data using an utf8 filename (from lua for 
instance) i can perfectly well reload that file. Some applications will 
show proper (utf8) names, others, like 'dir' in the console, will show 
bytes as e.g. latin. Not much different from what one gets when one logs 
into a remote machine with a different terminal setup.

Which reminds me: last week i entered an lua interactive console on 
ubuntu and magically ^3 was turned into this superscript unicode 3 
characters ... so, talking of a mess up ... to some extend I can 
understand such default behaviour so I'll live with it.

It's cut 'n paste and assumptions of other applications that (at least 
on windows) can turn something utf8 into something looking weird. It's 
really not much different from typesetting an utf8 encoded document in a 
tex that expects 8 bit texnansi encoding. The typeset stream looks weird 
but in fact is honest utf8 visualized.

Of course we could introduce an abstract filename object (including all 
these attributes that relates to file) but it's not really a solution. 
Simply converting utf8 encoded filenames into utf16 doesn't work out 
well because in between we use C-strings and these have this '60 
properties of being zero terminated so in practice one ends up with 
utf16 names clipped to length 1.

When on windows one mixes applications in a workflow it is important to 
make sure that one doesn't get code page translations in the way. 
Anyoing indeed, but using computers is full of annoyances. You don't 
want to know what troubles we sometimes have with graphics coming from 
apple infrastructures to linux infrastructure where users. Filenames is 
always a bit of an issue.


                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl

More information about the dev-luatex mailing list