[…]
2. A bigger solution might be to use selnoligs patterns in a script
that can be run over a large corpus, such as the DWDS (Digitales
Wörterbuch der deutschen Sprache). That should produce us a more
complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start
from the source than from some interpretation; who know what more there
is to uncover)
As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper:
https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf
There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, …
HTH
Ralf