[NTG-context] Ligature suppression word list

denis.maier at ub.unibe.ch denis.maier at ub.unibe.ch
Sat Apr 3 17:06:22 CEST 2021

Hi everyone

Now that Hans has implemented the new ligature suppression mechanism via language goodies - thanks again Hans! - we now need to come up with wordlists.

I've started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist.tex

You can find the current list here : https://github.com/denismaier/context-nolig-wordlist

The list is currently organized as follows :

  1.  L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it's even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur.
  2.  L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f.
  3.  L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature.
  4.  The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»

Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included - either via mail or directly on github.

More generally, there's the question how such a list should be enhanced? I was thinking about two options:

  1.  The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven't been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist.
  2.  A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.

What do you think?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20210403/bf5548a1/attachment.htm>

More information about the ntg-context mailing list