Digraphs ij, nj, lj, ch, ... (was: new upload)
On 8/26/07, Hans Hagen wrote:
anyhow, ... new upload to play with
Thanks! "Lj"igatures work much better now ;), but \word lost its functionality, it seems.
- LM doesn't have any lj, nj, dz, dž, ... (probably another request for the Polish guys)
hm, just write a small proposal ...
I already did.
however, dealing with non present chars is to be dealt with anyway
- It would be great if MK IV did the trasformation from digraphs to normal letters in case those digraphs are not present in the font itself (for ij, lj, nj, dz, dž, ... just as it would be great if ccaron was automatically composed out of c and caron if the letter wasn't present in that font).
\definefontfeature [test][mode=node,language=dflt,script=latn,complement=yes]
{\font\test = lmtypewriter8-regular*test at 12.3pt \test ljubljana Ljubljana LJUBLJANA }
currently the complement only replaces LATIN/compat combinations (see char-def.lua)
(encoding in email screwed up a bit, but no problem) Great! This works perfect! It works as expected for fonts with no such glyphs (and could/should be added to default font features in my opinion). When tested with fonts including original glyphs (I was testing with IJ), the original glyph was used, so that was perfectly OK.
Visually there is probably no difference in plain text, except in exactly the cases for which you're sending the tests (that's casing and spacing). See http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet how the word "MJENJACNICA" is split into letters. Normal people still type n+j in text, not the digraph "?" (nj), but in case you get some text with those digraphs which are valid Unicode letters, it would be nice if they were processed ...
dealing with n+j in text is too dangerous to catch, unless we start implementing complex language depenent replacements, and even then it's messy (what to do when one really wants a nj (two char)) ... so, thos old docs can best be converted to proper utf then
Hmmm ... pdfTeX with LM already replaces all occurencies of ij with an ij ligature (noticed when I took a look at the strange kerning between the two letters - not present with CM). :-Z Consider the desired output of \Word{ijsselmeer} % example taken from wikipedia, I don't speak Dutch yet ;-) when writing in Dutch. (\mainlanguage[nl], not when writing in English) One might like to treat every ij as a single letter and then convert both I and J to uppercase when asked for that. I read that Dutch keyboard includes the ij digraph (ij), but the Croatian/Serbian keyboards don't include those digraphs and none of the cp1250 and iso-8859-2 encodings have it, so everyone writes with "plain latin letters" - I doubt that anyone uses digraphs at all. (It would be almost as obscure and inconvenient to write them as if someone tried to write with "fi" unicode ligatures.) Yet the third group of people are the Czech/Slovaks with their digraph "ch". (You probably remember that one since you had to implement sorting rules for them for every variant and version of the sorting mechanism you have ever written.) Unicode doesn't even have place for it (http://unicode.org/faq/ligature_digraph.html). In Croatian, nj is always considered to be a single letter. (Even in foreign words, you would see "Isak Njutn" [Nj-u-t-n] or "Ajnštajn", so basically no worries about exceptions. Even if there would be some, one could always say {\language[en] a foreign word with nj})
Another few observations: - \word doesn't work in XeTeX
no, neither in pdftex i think; new
Currently it doesn't even work in luatex any more :-(
- What exactly is \Words supposed to do (with non-first letters in a word)?
make first chars uppercase but only when the next is a char; (i changed it a bit, defs were not seen (overloaded later by macros)
An extra challenge would be to get this work (but unless some Croats ask you for that or unless you have too much time left, don't bother about that - it needs slightly more than only lccode and uccode of a letter since there are three forms: one for lowercase [ljubljana -> lj], one for all-uppercase words [LJUBLJANA -> LJ] and one for the first letter of a word starting with an uppercase [Ljubljana -> Lj]):
In Unicode:
\word{?ub?ana} -> ?ub?ana \Word{?ub?ana} -> ?ub?ana \WORD{?ub?ana} -> ?UB?ANA
\word{?ub?ana} -> ?ub?ana \Word{?ub?ana} -> ?ub?ana \WORD{?ub?ana} -> ?UB?ANA
\word{?UB?ANA} -> ?ub?ana \Word{?UB?ANA} -> ?ub?ana \WORD{?UB?ANA} -> ?UB?ANA
as long as we have utf it's already taken care of
It's not. In contrary (written in latin without ligatures): \WORD{Lj} -> LJ (let's say it's OK) \WORD{LJ} -> Lj (wrong) \word lost it's functionality, so I cannot check. The main problem is that: - ligature lj is always lowercase - ligature Lj is uppercase, but only at the beginning of a word where other letters are lowercase (Ljubljana) - ligature LJ is uppercase, but only at the beginning of a word where other letters are uppercase (LJUBJANA) This works as long as L and J are two separate letters, but fails in the example where we have ligatures/digraps (even the basic functionality is currently broken). See http://unicode.org/charts/case/index.html (search for 01C7 on that page - while most letter only have lowercase and uppercase, those few also have some kind of "middle" case.) But again: I have no idea how many people use \Word and \WORD to capitalize words (I used it in some cases where I modified the macro to get really fancy beginning of words). Most would probably solve the problem "manually" anyway. (So don't bother about implementing until someone really requests it.)
In Latin transcript (in case you have problems seing some Unicode letters):
\word{ljubljana} -> ljubljana \Word{ljubljana} -> Ljubljana \WORD{ljubljana} -> LJUBLJANA
\word{Ljubljana} -> ljubljana \Word{Ljubljana} -> Ljubljana \WORD{Ljubljana} -> LJUBLJANA
\word{LJUBLJANA} -> ljubljana \Word{LJUBLJANA} -> Ljubljana \WORD{LJUBLJANA} -> LJUBLJANA
{\setcharacterkerning[extrakerning]\input zapf\endgraf }
hm, i'm not going to backport everything; keep in mind that i these features are not font related; actually future mkiv versions will also do dynamic feature change so ...
(This has recently been added to XeTeX as well. So it doesn't necessary mean "backport the lua functionality", but more "map this keyword to that XeTeX feature". But I would need to check. Don't worry about it. If anyone asks, then maybe ...) Thanks a lot, Mojca
Mojca Miklavec wrote:
Hmmm ... pdfTeX with LM already replaces all occurencies of ij with an ij ligature (noticed when I took a look at the strange kerning between the two letters - not present with CM). :-Z
that uses lc codes, mkiv uses node list parsing; this is also robust for complex commands inside the conversion ... and mkiv works cross page -)
Consider the desired output of \Word{ijsselmeer} % example taken from wikipedia, I don't speak Dutch yet ;-) when writing in Dutch. (\mainlanguage[nl], not when writing in English) One might like to treat every ij as a single letter and then convert both I and J to uppercase when asked for that.
yes, but not i+j may be ij
I read that Dutch keyboard includes the ij digraph (ij), but the Croatian/Serbian keyboards don't include those digraphs and none of the cp1250 and iso-8859-2 encodings have it, so everyone writes with "plain latin letters" - I doubt that anyone uses digraphs at all. (It would be almost as obscure and inconvenient to write them as if someone tried to write with "fi" unicode ligatures.)
The main problem is that: - ligature lj is always lowercase - ligature Lj is uppercase, but only at the beginning of a word where other letters are lowercase (Ljubljana) - ligature LJ is uppercase, but only at the beginning of a word where other letters are uppercase (LJUBJANA)
well, in utf it should be ok, since then we use info from the tables Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (2)
-
Hans Hagen
-
Mojca Miklavec