[NTG-context] Ligature suppression word list

denis.maier at ub.unibe.ch denis.maier at ub.unibe.ch
Tue Apr 6 16:59:59 CEST 2021



> -----Ursprüngliche Nachricht-----
> Von: Hans Hagen <j.hagen at xs4all.nl>
> Gesendet: Samstag, 3. April 2021 17:58
> An: mailing list for ConTeXt users <ntg-context at ntg.nl>; Maier, Denis
> Christian (UB) <denis.maier at ub.unibe.ch>
> Betreff: Re: [NTG-context] Ligature suppression word list
> 
> On 4/3/2021 5:06 PM, denis.maier at ub.unibe.ch wrote:
> > Hi everyone
> >
> > Now that Hans has implemented the new ligature suppression mechanism
> > via language goodies - thanks again Hans! - we now need to come up
> > with wordlists.
> >
> > I've started working on a list of German words with ligatures that
> > should be suppressed. The list is derived from the word list that
> > comes with the lualatex selnolig package:
> > https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo
> > rdlist.tex
> > <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w
> > ordlist.tex>
> >
> > You can find the current list here :
> > https://github.com/denismaier/context-nolig-wordlist
> > <https://github.com/denismaier/context-nolig-wordlist>
> >
> > The list is currently organized as follows :
> >
> >  1. L.25-l.35: This specifies words where automatic pattern matching is
> >     more difficult than usually because the words contain multiple
> >     ligatures, some of which must be suppressed while others must be
> >     preserved. In the case of « Auflagefläche » it's even the same
> >     combination of letters. So here, we use the bar | to manually
> >     indicate points where no ligature must occur.
> >  2. L. 36ff.: The vast amount of words is currently in that list that
> >     specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
> >     broken up after the first f.
> >  3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
> >     prevented after the second f, so the first two fs form a ligature.
> >  4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
> >     and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
> >     « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> >
> > Obviously, that list is far from being complete, and the question is
> > if it ever can be. Please have a look and feel free to propose more
> > words to be included - either via mail or directly on github.
> >
> > More generally, there's the question how such a list should be enhanced?
> > I was thinking about two options:
> >
> >  1. The new language options features include a tracker that allows for
> >     tracking for which words in a given document ligature prevention
> >     happened, and which words haven't been touched by the mechanism. It
> >     should be possible to analyze the log file and to create lists of
> >     words with ligatures. Should be a rather simple step to derive new
> >     words for the ligature-suppression wordlist.
> >  2. A bigger solution might be to use selnoligs patterns in a script
> >     that can be run over a large corpus, such as the DWDS (Digitales
> >     Wörterbuch der deutschen Sprache). That should produce us a more
> >     complete list of words where ligatures must be suppressed.
> 
> where is that DWDS ... i can write some code to deal with it (i'd rather start
> from the source than from some interpretation; who know what more there
> is to uncover)

The DWDS is here: https://www.dwds.de/
But I still need to check how we can extract the words from there...

Denis


More information about the ntg-context mailing list