Re: [NTG-context] Ligature suppression word list
Message: 2 Date: Tue, 6 Apr 2021 15:03:54 +0000 From:
mailto:denis.maier@ub.unibe.ch> To: mailto:j.hagen@xs4all.nl>, mailto:ntg-context@ntg.nl> Subject: Re: [NTG-context] Ligature suppression word list Message-ID: <41e6530172b54bffb7a82febff0a6be5@ub.unibe.ch mailto:41e6530172b54bffb7a82febff0a6be5@ub.unibe.ch> Content-Type: text/plain; charset="iso-8859-1" -----Ursprüngliche Nachricht----- Von: Hans Hagen
mailto:j.hagen@xs4all.nl> Gesendet: Samstag, 3. April 2021 17:58 An: mailing list for ConTeXt users mailto:ntg-context@ntg.nl>; Maier, Denis Christian (UB) mailto:denis.maier@ub.unibe.ch> Betreff: Re: [NTG-context] Ligature suppression word list
[…]
2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start from the source than from some interpretation; who know what more there is to uncover)
As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnoli... https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnoli...
A lot of corpora can be found here: https://wortschatz.uni-leipzig.de/de https://wortschatz.uni-leipzig.de/de especially here: https://wortschatz.uni-leipzig.de/de/download/German https://wortschatz.uni-leipzig.de/de/download/German There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, … HTH Ralf
Von: ntg-context
Hi,
a small update on this one :
I’ve built a small python script that uses the patterns from the selnolig package to extract words with suspicious ligatures from the word list provided by the Uni Leipzig corpus project. Running the script over a corpus of over 1 million words produces the attached word list. The resulting list of words is not huge. That corpus gives us a list of about 790 words. I’ll need to check whether they already are in the goodies file or if I need to add them.
Anyway, I was thinking about making such a script more generic. Think of something along the lines of:
pdftotext book.pdf | showIncorrectLigatures.py > incorrect-ligatures.txt
Denis
Von: ntg-context
participants (2)
-
denis.maier@ub.unibe.ch
-
rha17@t-online.de