[NTG-context] Ligature suppression word list
denis.maier at ub.unibe.ch
denis.maier at ub.unibe.ch
Tue Apr 6 16:59:59 CEST 2021
> -----Ursprüngliche Nachricht-----
> Von: Hans Hagen <j.hagen at xs4all.nl>
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users <ntg-context at ntg.nl>; Maier, Denis
> Christian (UB) <denis.maier at ub.unibe.ch>
> Betreff: Re: [NTG-context] Ligature suppression word list
> On 4/3/2021 5:06 PM, denis.maier at ub.unibe.ch wrote:
> > Hi everyone
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> > The list is currently organized as follows :
> > 1. L.25-l.35: This specifies words where automatic pattern matching is
> > more difficult than usually because the words contain multiple
> > ligatures, some of which must be suppressed while others must be
> > preserved. In the case of « Auflagefläche » it's even the same
> > combination of letters. So here, we use the bar | to manually
> > indicate points where no ligature must occur.
> > 2. L. 36ff.: The vast amount of words is currently in that list that
> > specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> > broken up after the first f.
> > 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> > prevented after the second f, so the first two fs form a ligature.
> > 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> > and l. 2277 suppress ligatures for « ft » and « fft », « fb » and
> > « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> > 1. The new language options features include a tracker that allows for
> > tracking for which words in a given document ligature prevention
> > happened, and which words haven't been touched by the mechanism. It
> > should be possible to analyze the log file and to create lists of
> > words with ligatures. Should be a rather simple step to derive new
> > words for the ligature-suppression wordlist.
> > 2. A bigger solution might be to use selnoligs patterns in a script
> > that can be run over a large corpus, such as the DWDS (Digitales
> > Wörterbuch der deutschen Sprache). That should produce us a more
> > complete list of words where ligatures must be suppressed.
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)
The DWDS is here: https://www.dwds.de/
But I still need to check how we can extract the words from there...
More information about the ntg-context