-----Ursprüngliche Nachricht----- Von: Hans Hagen
Gesendet: Samstag, 3. April 2021 17:58 An: mailing list for ConTeXt users ; Maier, Denis Christian (UB) Betreff: Re: [NTG-context] Ligature suppression word list On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
Hi everyone
Now that Hans has implemented the new ligature suppression mechanism via language goodies - thanks again Hans! - we now need to come up with wordlists.
I've started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo rdlist.tex <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w ordlist.tex>
You can find the current list here : https://github.com/denismaier/context-nolig-wordlist https://github.com/denismaier/context-nolig-wordlist
The list is currently organized as follows :
1. L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it's even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur. 2. L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f. 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature. 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft », « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included - either via mail or directly on github.
More generally, there's the question how such a list should be enhanced? I was thinking about two options:
1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven't been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. 2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start from the source than from some interpretation; who know what more there is to uncover)
The DWDS is here: https://www.dwds.de/ But I still need to check how we can extract the words from there... Denis