[NTG-context] Ligature suppression word list

Hans Hagen j.hagen at xs4all.nl
Sat Apr 3 17:58:26 CEST 2021

On 4/3/2021 5:06 PM, denis.maier at ub.unibe.ch wrote:
> Hi everyone
> Now that Hans has implemented the new ligature suppression mechanism via 
> language goodies – thanks again Hans! – we now need to come up with 
> wordlists.
> I’ve started working on a list of German words with ligatures that 
> should be suppressed. The list is derived from the word list that comes 
> with the lualatex selnolig package: 
> https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist.tex 
> <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist.tex>
> You can find the current list here : 
> https://github.com/denismaier/context-nolig-wordlist 
> <https://github.com/denismaier/context-nolig-wordlist>
> The list is currently organized as follows :
>  1. L.25-l.35: This specifies words where automatic pattern matching is
>     more difficult than usually because the words contain multiple
>     ligatures, some of which must be suppressed while others must be
>     preserved. In the case of « Auflagefläche » it’s even the same
>     combination of letters. So here, we use the bar | to manually
>     indicate points where no ligature must occur.
>  2. L. 36ff.: The vast amount of words is currently in that list that
>     specifies words where a ff, fl, fi, ffi, or ffl ligature has to be
>     broken up after the first f.
>  3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be
>     prevented after the second f, so the first two fs form a ligature.
>  4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225,
>     and l. 2277 suppress ligatures for « ft » and « fft »,  « fb » and
>     « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
> Obviously, that list is far from being complete, and the question is if 
> it ever can be. Please have a look and feel free to propose more words 
> to be included – either via mail or directly on github.
> More generally, there’s the question how such a list should be enhanced? 
> I was thinking about two options:
>  1. The new language options features include a tracker that allows for
>     tracking for which words in a given document ligature prevention
>     happened, and which words haven’t been touched by the mechanism. It
>     should be possible to analyze the log file and to create lists of
>     words with ligatures. Should be a rather simple step to derive new
>     words for the ligature-suppression wordlist.
>  2. A bigger solution might be to use selnoligs patterns in a script
>     that can be run over a large corpus, such as the DWDS (Digitales
>     Wörterbuch der deutschen Sprache). That should produce us a more
>     complete list of words where ligatures must be suppressed.

where is that DWDS ... i can write some code to deal with it (i'd rather 
start from the source than from some interpretation; who know what more 
there is to uncover)

additional info: we're talking of a mechanism sort of integrated in the 
hyphenation loop, where we can also handle compound words, if needed 
with details about how influence to hyphenate these) so the above 
question involves:

- exceptions to exceptions
- replacements before hyphenation
- compound words (including lhmin/rhmin overloads)
- (left right two sided) ligature and/or kern prevention

and whatever we like/need more (within reasonable bounds),


                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl

More information about the ntg-context mailing list