Digraphs ij, nj, lj, ch, ... (was: new upload)

27 Aug 2007

      On 8/26/07, Hans Hagen wrote:
...
anyhow, ... new upload to play with
Thanks! "Lj"igatures work much better now ;), but \word lost its
functionality, it seems.
...
...
- LM doesn't have any lj, nj, dz, dž, ... (probably another request
for the Polish guys)
hm, just write a small proposal ...
I already did.
...
however, dealing with non present chars is to be dealt with anyway
...
- It would be great if MK IV did the trasformation from digraphs to
normal letters in case those digraphs are not present in the font
itself (for ij, lj, nj, dz, dž, ... just as it would be great if
ccaron was automatically composed out of c and caron if the letter
wasn't present in that font).
\definefontfeature
[test][mode=node,language=dflt,script=latn,complement=yes]
{\font\test = lmtypewriter8-regular*test at 12.3pt \test Ç‰ubÇ‰ana
ÇˆubÇ‰ana  Ç‡UBÇ‡ANA }
currently the complement only replaces LATIN/compat combinations (see
char-def.lua)
(encoding in email screwed up a bit, but no problem)

Great! This works perfect!

It works as expected for fonts with no such glyphs (and could/should
be added to default font features in my opinion). When tested with
fonts including original glyphs (I was testing with IJ), the original
glyph was used, so that was perfectly OK.
...
...
Visually there is probably no difference in plain text, except in
exactly the cases for which you're sending the tests (that's casing
and spacing). See http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet
how the word "MJENJACNICA" is split into letters.
Normal people still type n+j in text, not the digraph "?" (nj), but in
case you get some text with those digraphs which are valid Unicode
letters, it would be nice if they were processed ...
dealing with n+j in text is too dangerous to catch, unless we start
implementing complex language depenent replacements, and even then it's
messy (what to do when one really wants a nj (two char)) ... so, thos
old docs can best be converted to proper utf then
Hmmm ... pdfTeX with LM already replaces all occurencies of ij with an
ij ligature (noticed when I took a look at the strange kerning between
the two letters - not present with CM). :-Z

Consider the desired output of
    \Word{ijsselmeer} % example taken from wikipedia, I don't speak
Dutch yet ;-)
when writing in Dutch. (\mainlanguage[nl], not when writing in English)
One might like to treat every ij as a single letter and then convert
both I and J to uppercase when asked for that.

I read that Dutch keyboard includes the ij digraph (ĳ), but the
Croatian/Serbian keyboards don't include those digraphs and none of
the cp1250 and iso-8859-2 encodings have it, so everyone writes with
"plain latin letters" - I doubt that anyone uses digraphs at all. (It
would be almost as obscure and inconvenient to write them as if
someone tried to write with "fi" unicode ligatures.)

Yet the third group of people are the Czech/Slovaks with their digraph
"ch". (You probably remember that one since you had to implement
sorting rules for them for every variant and version of the sorting
mechanism you have ever written.) Unicode doesn't even have place for
it (http://unicode.org/faq/ligature_digraph.html).

In Croatian, nj is always considered to be a single letter. (Even in
foreign words, you would see "Isak Njutn" [Nj-u-t-n] or "Ajnštajn", so
basically no worries about exceptions. Even if there would be some,
one could always say {\language[en] a foreign word with nj})
...
...
Another few observations:
- \word doesn't work in XeTeX
no, neither in pdftex i think; new
Currently it doesn't even work in luatex any more :-(
...
...
- What exactly is \Words supposed to do (with non-first letters in a
word)?
make first chars uppercase but only when the next is a char; (i changed
it a bit, defs were not seen (overloaded later by macros)
...
An extra challenge would be to get this work (but unless some Croats
ask you for that or unless you have too much time left, don't bother
about that - it needs slightly more than only lccode and uccode of a
letter since there are three forms: one for lowercase [ljubljana ->
lj], one for all-uppercase words [LJUBLJANA -> LJ] and one for the
first letter of a word starting with an uppercase [Ljubljana -> Lj]):
In Unicode:
\word{?ub?ana} -> ?ub?ana
\Word{?ub?ana} -> ?ub?ana
\WORD{?ub?ana} -> ?UB?ANA
\word{?ub?ana} -> ?ub?ana
\Word{?ub?ana} -> ?ub?ana
\WORD{?ub?ana} -> ?UB?ANA
\word{?UB?ANA} -> ?ub?ana
\Word{?UB?ANA} -> ?ub?ana
\WORD{?UB?ANA} -> ?UB?ANA
as long as we have utf it's already taken care of
It's not.

In contrary (written in latin without ligatures):
\WORD{Lj} -> LJ (let's say it's OK)
\WORD{LJ} -> Lj (wrong)

\word lost it's functionality, so I cannot check.

The main problem is that:
- ligature lj is always lowercase
- ligature Lj is uppercase, but only at the beginning of a word where
other letters are lowercase (Ljubljana)
- ligature LJ is uppercase, but only at the beginning of a word where
other letters are uppercase (LJUBJANA)

This works as long as L and J are two separate letters, but fails in
the example where we have ligatures/digraps (even the basic
functionality is currently broken).

See http://unicode.org/charts/case/index.html (search for 01C7 on that
page - while most letter only have lowercase and uppercase, those few
also have some kind of "middle" case.)

But again: I have no idea how many people use \Word and \WORD to
capitalize words (I used it in some cases where I modified the macro
to get really fancy beginning of words). Most would probably solve the
problem "manually" anyway. (So don't bother about implementing until
someone really requests it.)
...
...
In Latin transcript (in case you have problems seing some Unicode
letters):
\word{ljubljana} -> ljubljana
\Word{ljubljana} -> Ljubljana
\WORD{ljubljana} -> LJUBLJANA
\word{Ljubljana} -> ljubljana
\Word{Ljubljana} -> Ljubljana
\WORD{Ljubljana} -> LJUBLJANA
\word{LJUBLJANA} -> ljubljana
\Word{LJUBLJANA} -> Ljubljana
\WORD{LJUBLJANA} -> LJUBLJANA
...
...
...
{\setcharacterkerning[extrakerning]\input zapf\endgraf }
hm, i'm not going to backport everything; keep in mind that i these
features are not font related; actually future mkiv versions will also
do dynamic feature change so ...
(This has recently been added to XeTeX as well. So it doesn't
necessary mean "backport the lua functionality", but more "map this
keyword to that XeTeX feature". But I would need to check. Don't worry
about it. If anyone asks, then maybe ...)

Thanks a lot,
    Mojca

Mojca Miklavec

Hans Hagen

tags

participants (2)