Hi,
a small update on this one :
I’ve built a small python script that uses the patterns from the selnolig package to extract words with suspicious ligatures from the word list provided by the Uni Leipzig corpus project.
Running the script over a corpus of over 1 million words produces the attached word list.
The resulting list of words is not huge.
That corpus gives us a list of about 790 words. I’ll need to check whether they already are in the goodies file or if I need to add them.
Anyway, I was thinking about making such a script more generic. Think of something along the lines of:
pdftotext book.pdf | showIncorrectLigatures.py > incorrect-ligatures.txt
Denis
Von: ntg-context <ntg-context-bounces@ntg.nl>
Im Auftrag von rha17@t-online.de
Gesendet: Mittwoch, 7. April 2021 20:20
An: ntg-context@ntg.nl
Betreff: Re: [NTG-context] Ligature suppression word list
Message: 2
Date: Tue, 6 Apr 2021 15:03:54 +0000
From: <denis.maier@ub.unibe.ch>
To: <j.hagen@xs4all.nl>, <ntg-context@ntg.nl>
Subject: Re: [NTG-context] Ligature suppression word list
Message-ID: <41e6530172b54bffb7a82febff0a6be5@ub.unibe.ch>
Content-Type: text/plain; charset="iso-8859-1"
-----Ursprüngliche Nachricht-----
Von: Hans Hagen <j.hagen@xs4all.nl>
Gesendet: Samstag, 3. April 2021 17:58
An: mailing list for ConTeXt users <ntg-context@ntg.nl>; Maier, Denis
Christian (UB) <denis.maier@ub.unibe.ch>
Betreff: Re: [NTG-context] Ligature suppression word list
[…]
2. A bigger solution might be to use selnoligs patterns in a script
that can be run over a large corpus, such as the DWDS (Digitales
Wörterbuch der deutschen Sprache). That should produce us a more
complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start
from the source than from some interpretation; who know what more there
is to uncover)
As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf
A lot of corpora can be found here: https://wortschatz.uni-leipzig.de/de
especially here: https://wortschatz.uni-leipzig.de/de/download/German
There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, …
HTH
Ralf