Hi,

 

a small update on this one :

I’ve built a small python script that uses the patterns from the selnolig package to extract words with suspicious ligatures from the word list provided by the Uni Leipzig corpus project. Running the script over a corpus of over 1 million words produces the attached word list. The resulting list of words is not huge. That corpus gives us a list of about 790 words. I’ll need to check whether they already are in the goodies file or if I need to add them.

 

Anyway, I was thinking about making such a script more generic. Think of something along the lines of:

pdftotext book.pdf | showIncorrectLigatures.py > incorrect-ligatures.txt

 

Denis

 

 

Von: ntg-context <ntg-context-bounces@ntg.nl> Im Auftrag von rha17@t-online.de
Gesendet: Mittwoch, 7. April 2021 20:20
An: ntg-context@ntg.nl
Betreff: Re: [NTG-context] Ligature suppression word list

 



Message: 2
Date: Tue, 6 Apr 2021 15:03:54 +0000
From: <denis.maier@ub.unibe.ch>
To: <j.hagen@xs4all.nl>, <ntg-context@ntg.nl>
Subject: Re: [NTG-context] Ligature suppression word list
Message-ID: <41e6530172b54bffb7a82febff0a6be5@ub.unibe.ch>
Content-Type: text/plain; charset="iso-8859-1"


-----Ursprüngliche Nachricht-----
Von: Hans Hagen <j.hagen@xs4all.nl>
Gesendet: Samstag, 3. April 2021 17:58
An: mailing list for ConTeXt users <ntg-context@ntg.nl>; Maier, Denis
Christian (UB) <denis.maier@ub.unibe.ch>
Betreff: Re: [NTG-context] Ligature suppression word list

 

[…]





2. A bigger solution might be to use selnoligs patterns in a script
   that can be run over a large corpus, such as the DWDS (Digitales
   Wörterbuch der deutschen Sprache). That should produce us a more
   complete list of words where ligatures must be suppressed.


where is that DWDS ... i can write some code to deal with it (i'd rather start
from the source than from some interpretation; who know what more there
is to uncover)


As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf

 

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de

especially here: https://wortschatz.uni-leipzig.de/de/download/German

 

There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, …

 

HTH

 

Ralf