Index sorting for other languages that English
Hello Hans, after an upgrade I noticed thar the index sorting works even worse than before (tested on Czech, Chinese and Japanese, but probably related to non-ASCII characters in common). With TeXExec 5.4.3, all words beginning with national (accented) characters were put into a separate ("symbols") group and placed before "A". This was not good but more or less acceptable. With TeXExec 6.2.0, words beginning with accented characters are placed under certain unaccented letter. My colleague found out that these words are sorted according the first unaccented letter. This is unacceptable and unusable. We do a "work-around" so we try to avoid indexing words beginning with accented charaters. But it's impossible in many cases. I'd like to ask you to improve the index sorting. Could I help or contribute in some way? Attached is a testing file, which creates 2 indexes from various Czech words (covering the Czech alphabet). The index should be sorted exactly that way as the terms are written in the file. Thanks, Richard
On Tuesday 23 May 2006 06:22, Richard Gabriel wrote:
Hello Hans,
after an upgrade I noticed thar the index sorting works even worse than before (tested on Czech, Chinese and Japanese, but probably related to non-ASCII characters in common).
With TeXExec 5.4.3, all words beginning with national (accented) characters were put into a separate ("symbols") group and placed before "A". This was not good but more or less acceptable. With TeXExec 6.2.0, words beginning with accented characters are placed under certain unaccented letter. My colleague found out that these words are sorted according the first unaccented letter. This is unacceptable and unusable.
We do a "work-around" so we try to avoid indexing words beginning with accented charaters. But it's impossible in many cases. I'd like to ask you to improve the index sorting. Could I help or contribute in some way?
Attached is a testing file, which creates 2 indexes from various Czech words (covering the Czech alphabet). The index should be sorted exactly that way as the terms are written in the file.
Thanks, Richard
Try Xindy. It has facilities for sorting according to arbitrary alphabetic orders including Czech. It fits in the workflow much as does makeindex, but perhaps it could be adapted to a Context runstream. -- John Culleton Books with answers to marketing and publishing questions: http://wexfordpress.com/tex/shortlist.pdf Book coaches, consultants and packagers: http://wexfordpress.com/tex/packagers.pdf
Richard Gabriel wrote:
Hello Hans,
after an upgrade I noticed thar the index sorting works even worse than before (tested on Czech, Chinese and Japanese, but probably related to non-ASCII characters in common).
With TeXExec 5.4.3, all words beginning with national (accented) characters were put into a separate ("symbols") group and placed before "A". This was not good but more or less acceptable. With TeXExec 6.2.0, words beginning with accented characters are placed under certain unaccented letter. My colleague found out that these words are sorted according the first unaccented letter. This is unacceptable and unusable.
We do a "work-around" so we try to avoid indexing words beginning with accented charaters. But it's impossible in many cases. I'd like to ask you to improve the index sorting. Could I help or contribute in some way?
Attached is a testing file, which creates 2 indexes from various Czech words (covering the Czech alphabet). The index should be sorted exactly that way as the terms are written in the file.
actually the nex texexec implementation does czech sorting but it's not enables yet in context itself (was experimental until now) - download the latest version (i uploaded a version that enables it) - don't forget \mainlanguage[cz] at the top of your document - in sort-lan.tex you can see how czech sorting is defined (context adds a lot of into to the tui file in order to get sorting done) ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Thanks Hans, it works with my test file, unless I set up: \setupregister[index][expansion=xml] which i need for correct processing of the XML files. If I simply add this command into the testing TeX file (no XML), the Czech sorting stops to work and all accented characters are placed under "A". Regarding the sorting itself (sort-lan.tex): I found the definiton of the sorting quite strange, let's say, incomplete. It makes no sense to separate ccaron while all other accented letters are placed under the unaccented ones. I'll update the definitions, test it and send it to you. -Richard _____ From: Hans Hagen [mailto:pragma@wxs.nl] To: mailing list for ConTeXt users [mailto:ntg-context@ntg.nl] Sent: Tue, 23 May 2006 17:02:53 +0200 Subject: Re: [NTG-context] Index sorting for other languages that English Richard Gabriel wrote:
Hello Hans,
after an upgrade I noticed thar the index sorting works even worse than before (tested on Czech, Chinese and Japanese, but probably related to non-ASCII characters in common).
With TeXExec 5.4.3, all words beginning with national (accented) characters were put into a separate ("symbols") group and placed before "A". This was not good but more or less acceptable. With TeXExec 6.2.0, words beginning with accented characters are placed under certain unaccented letter. My colleague found out that these words are sorted according the first unaccented letter. This is unacceptable and unusable.
We do a "work-around" so we try to avoid indexing words beginning with accented charaters. But it's impossible in many cases. I'd like to ask you to improve the index sorting. Could I help or contribute in some way?
Attached is a testing file, which creates 2 indexes from various Czech words (covering the Czech alphabet). The index should be sorted exactly that way as the terms are written in the file.
actually the nex texexec implementation does czech sorting but it's not enables yet in context itself (was experimental until now) - download the latest version (i uploaded a version that enables it) - don't forget \mainlanguage[cz] at the top of your document - in sort-lan.tex you can see how czech sorting is defined (context adds a lot of into to the tui file in order to get sorting done) ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- _______________________________________________ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Richard Gabriel wrote:
Thanks Hans, it works with my test file, unless I set up:
\setupregister[index][expansion=xml]
which i need for correct processing of the XML files. If I simply add this command into the testing TeX file (no XML), the Czech sorting stops to work and all accented characters are placed under "A". test file ...
Regarding the sorting itself (sort-lan.tex): I found the definiton of the sorting quite strange, let's say, incomplete. It makes no sense to separate ccaron while all other accented letters are placed under the unaccented ones. I'll update the definitions, test it and send it to you. ok
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Here is the test file. If you remove the \setupregister command, or simply set expansion=no, the sorting will work perfectly. With expansion=yes or expansion=xml, the accented letters are sorted under "A". Below are my updated sorting rules again... -Richard --- \def\czsortdivisionch{ch} \def\czsortdivisionCh{Ch} \startmode[sortorder-cz] \exportsortexpansion{aacute}{a+1} \exportsortexpansion{Aacute}{A+1} \exportsortexpansion{ccaron}{c+1} \exportsortexpansion{Ccaron}{C+1} \exportsortdivision{c+1}{ccaron} \exportsortexpansion{dcaron}{d+1} \exportsortexpansion{Dcaron}{C+1} \exportsortdivision{d+1}{dcaron} \exportsortexpansion{eacute}{e+1} \exportsortexpansion{Eacute}{E+1} \exportsortexpansion{ecaron}{e+2} \exportsortexpansion{Ecaron}{E+2} \exportsortreduction{ch}{h+1} \exportsortexpansion{ch}{h+1} \exportsortreduction{Ch}{h+1} \exportsortexpansion{Ch}{h+1} \exportsortdivision{h+1}{czsortdivisionch} \exportsortexpansion{iacute}{i+1} \exportsortexpansion{Iacute}{I+1} \exportsortexpansion{ncaron}{n+1} \exportsortexpansion{Ncaron}{n+1} \exportsortdivision{n+1}{ncaron} \exportsortexpansion{oacute}{o+1} \exportsortexpansion{Oacute}{O+1} \exportsortexpansion{rcaron}{r+1} \exportsortexpansion{Rcaron}{R+1} \exportsortdivision{r+1}{rcaron} \exportsortexpansion{scaron}{s+1} \exportsortexpansion{Scaron}{S+1} \exportsortdivision{s+1}{scaron} \exportsortexpansion{tcaron}{t+1} \exportsortexpansion{Tcaron}{T+1} \exportsortdivision{t+1}{tcaron} \exportsortexpansion{uacute}{u+1} \exportsortexpansion{Uacute}{U+1} \exportsortexpansion{uring}{u+2} \exportsortexpansion{Uring}{U+2} \exportsortexpansion{yacute}{y+1} \exportsortexpansion{Yacute}{Y+1} \exportsortexpansion{zcaron}{z+1} \exportsortexpansion{Zcaron}{Z+1} \exportsortdivision{z+1}{zcaron} \stopmode _____ From: Hans Hagen [mailto:pragma@wxs.nl] To: mailing list for ConTeXt users [mailto:ntg-context@ntg.nl] Sent: Wed, 24 May 2006 17:55:02 +0200 Subject: Re: [NTG-context] Index sorting for other languages that English Richard Gabriel wrote:
Thanks Hans, it works with my test file, unless I set up:
\setupregister[index][expansion=xml]
which i need for correct processing of the XML files. If I simply add this command into the testing TeX file (no XML), the Czech sorting stops to work and all accented characters are placed under "A". test file ...
Regarding the sorting itself (sort-lan.tex): I found the definiton of the sorting quite strange, let's say, incomplete. It makes no sense to separate ccaron while all other accented letters are placed under the unaccented ones. I'll update the definitions, test it and send it to you. ok
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- _______________________________________________ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Hi all, I have a document in Dutch (\mainlanguage[nl]) in which I quote Turkish items, which I want to collect in a separate index, like this: "Enkele voorbeelden zijn: \quote{oudere zus} \turkish{abla}, \quote{jongere broer of zus} \turkish{karde\c{s}}, de \quote{zus van vader} (\quote{tante}) \turkish{hala, \quote{de zus van moeder} \turkish{teyze}. Voor aangetrouwde familieleden gelden soms juist vagere termen dan in het Nederlands, bijv. \quote{aangetrouwde tante} en \quote{schoonzuster}, \turkish{yenge}." The index, however, is based on Dutch (mainlanguage). This causes two problems: 1. words with accents, like s\"oz, are not sorted correctly to any standard: S söz kesmek 76 saygı 14 s¸eref 3, 14, 24, 27 2. letters with diacritics, like \c{s} (under which \c{s}eref is to be placed) are not included in the alphabetical listing in the index, which of course follows the Dutch alphabet. Does anyone have a solution? Regards, Robert
I'd suggest you to use the extended variant of the \index macro. There you can specify an ASCII equivalent of the word, which will be used for sorting: \index[soz kesmek]{s\"oz kesmek} \index[seref]{\c seref} -Richard _____ From: "R. Ermers" [mailto:r.ermers@hccnet.nl] To: mailing list for ConTeXt users [mailto:ntg-context@ntg.nl] Sent: Tue, 30 May 2006 08:43:01 +0200 Subject: [NTG-context] Index sorting for other languages than English (2) Hi all, I have a document in Dutch (\mainlanguage[nl]) in which I quote Turkish items, which I want to collect in a separate index, like this: "Enkele voorbeelden zijn: \quote{oudere zus} \turkish{abla}, \quote{jongere broer of zus} \turkish{karde\c{s}}, de \quote{zus van vader} (\quote{tante}) \turkish{hala, \quote{de zus van moeder} \turkish{teyze}. Voor aangetrouwde familieleden gelden soms juist vagere termen dan in het Nederlands, bijv. \quote{aangetrouwde tante} en \quote{schoonzuster}, \turkish{yenge}." The index, however, is based on Dutch (mainlanguage). This causes two problems: 1. words with accents, like s\"oz, are not sorted correctly to any standard: S söz kesmek 76 saygı 14 s¸eref 3, 14, 24, 27 2. letters with diacritics, like \c{s} (under which \c{s}eref is to be placed) are not included in the alphabetical listing in the index, which of course follows the Dutch alphabet. Does anyone have a solution? Regards, Robert _______________________________________________ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Richard Gabriel wrote:
I'd suggest you to use the extended variant of the \index macro. There you can specify an ASCII equivalent of the word, which will be used for sorting:
\index[soz kesmek]{s\"oz kesmek} \index[seref]{\c seref} actually, supporting multiple indexes with their own sort order is kind of prepared but never completed, so i'll have a look at it
Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
R. Ermers wrote:
Hi all,
I have a document in Dutch (\mainlanguage[nl]) in which I quote Turkish items, which I want to collect in a separate index, like this:
"Enkele voorbeelden zijn: \quote{oudere zus} \turkish{abla}, \quote{jongere broer of zus} \turkish{karde\c{s}}, de \quote{zus van vader} (\quote{tante}) \turkish{hala, \quote{de zus van moeder} \turkish{teyze}. Voor aangetrouwde familieleden gelden soms juist vagere termen dan in het Nederlands, bijv. \quote{aangetrouwde tante} en \quote{schoonzuster}, \turkish{yenge}."
The index, however, is based on Dutch (mainlanguage). This causes two problems:
1. words with accents, like s\"oz, are not sorted correctly to any standard: S söz kesmek 76 saygı 14 s¸eref 3, 14, 24, 27
2. letters with diacritics, like \c{s} (under which \c{s}eref is to be placed) are not included in the alphabetical listing in the index, which of course follows the Dutch alphabet.
Does anyone have a solution?
hm, so we need a mixed sorting mechanism (in sort-lan.tex you can define a sort order for turkish but it still concerns the whole doc then) Hans
Hello Hans, I'm sorry but when you were adding my sorting rules for Czech, you've (probably by accident) deleted the definition of \czsortdivisionch which leads to errors when trying to sort a word on "ch". I've also made some minor corrections. Here is the updated version: % --- \def\czsortdivisionch{ch} \def\czsortdivisionCh{Ch} \startmode[sortorder-cz] \exportsortexpansion {aacute} {a+1} \exportsortexpansion {Aacute} {A+1} \exportsortexpansion {ccaron} {c+1} \exportsortexpansion {Ccaron} {C+1} \exportsortdivision {c+1} {ccaron} \exportsortexpansion {dcaron} {d+1} \exportsortexpansion {Dcaron} {D+1} \exportsortdivision {d+1} {dcaron} \exportsortexpansion {eacute} {e+1} \exportsortexpansion {Eacute} {E+1} \exportsortexpansion {ecaron} {e+2} \exportsortexpansion {Ecaron} {E+2} \exportsortreduction {ch} {h+1} \exportsortexpansion {ch} {h+1} \exportsortreduction {Ch} {H+1} \exportsortexpansion {Ch} {H+1} \exportsortdivision {h+1} {czsortdivisionch} \exportsortexpansion {iacute} {i+1} \exportsortexpansion {Iacute} {I+1} \exportsortexpansion {ncaron} {n+1} \exportsortexpansion {Ncaron} {N+1} \exportsortdivision {n+1} {ncaron} \exportsortexpansion {oacute} {o+1} \exportsortexpansion {Oacute} {O+1} \exportsortexpansion {rcaron} {r+1} \exportsortexpansion {Rcaron} {R+1} \exportsortdivision {r+1} {rcaron} \exportsortexpansion {scaron} {s+1} \exportsortexpansion {Scaron} {S+1} \exportsortdivision {s+1} {scaron} \exportsortexpansion {tcaron} {t+1} \exportsortexpansion {Tcaron} {T+1} \exportsortdivision {t+1} {tcaron} \exportsortexpansion {uacute} {u+1} \exportsortexpansion {Uacute} {U+1} \exportsortexpansion {uring} {u+2} \exportsortexpansion {Uring} {U+2} \exportsortexpansion {yacute} {y+1} \exportsortexpansion {Yacute} {Y+1} \exportsortexpansion {zcaron} {z+1} \exportsortexpansion {Zcaron} {Z+1} \exportsortdivision {z+1} {zcaron} \stopmode % ---
Richard Gabriel wrote:
Hello Hans,
I'm sorry but when you were adding my sorting rules for Czech, you've (probably by accident) deleted the definition of \czsortdivisionch which leads to errors when trying to sort a word on "ch". I've also made some minor corrections. Here is the updated version: i've (a bit more) finished teutil sorting so that it also can sort different registers conform their own language:
i'll post an alpha (generating now) that can do: \defineregister[one] \defineregister[two] \setupregister[two][language=cz] \starttext test \one{one} test \one{two} test \one {\aacute} test \one{alpha} test \one{chow} test \two{one} test \two{two} test \two {\aacute} test \two{alpha} test \two{chow} \blank[3*big] \placeregister[one] \blank[3*big] \placeregister[two] \stoptext ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (4)
-
Hans Hagen
-
John R. Culleton
-
R. Ermers
-
Richard Gabriel