[NTG-context] Ligature suppression word list

denis.maier at ub.unibe.ch denis.maier at ub.unibe.ch
Mon Apr 12 17:52:40 CEST 2021


a small update on this one :
I’ve built a small python script that uses the patterns from the selnolig package to extract words with suspicious ligatures from the word list provided by the Uni Leipzig corpus project. Running the script over a corpus of over 1 million words produces the attached word list. The resulting list of words is not huge. That corpus gives us a list of about 790 words. I’ll need to check whether they already are in the goodies file or if I need to add them.

Anyway, I was thinking about making such a script more generic. Think of something along the lines of:
pdftotext book.pdf | showIncorrectLigatures.py > incorrect-ligatures.txt


Von: ntg-context <ntg-context-bounces at ntg.nl> Im Auftrag von rha17 at t-online.de
Gesendet: Mittwoch, 7. April 2021 20:20
An: ntg-context at ntg.nl
Betreff: Re: [NTG-context] Ligature suppression word list

Message: 2
Date: Tue, 6 Apr 2021 15:03:54 +0000
From: <denis.maier at ub.unibe.ch<mailto:denis.maier at ub.unibe.ch>>
To: <j.hagen at xs4all.nl<mailto:j.hagen at xs4all.nl>>, <ntg-context at ntg.nl<mailto:ntg-context at ntg.nl>>
Subject: Re: [NTG-context] Ligature suppression word list
Message-ID: <41e6530172b54bffb7a82febff0a6be5 at ub.unibe.ch<mailto:41e6530172b54bffb7a82febff0a6be5 at ub.unibe.ch>>
Content-Type: text/plain; charset="iso-8859-1"

-----Ursprüngliche Nachricht-----
Von: Hans Hagen <j.hagen at xs4all.nl<mailto:j.hagen at xs4all.nl>>
Gesendet: Samstag, 3. April 2021 17:58
An: mailing list for ConTeXt users <ntg-context at ntg.nl<mailto:ntg-context at ntg.nl>>; Maier, Denis
Christian (UB) <denis.maier at ub.unibe.ch<mailto:denis.maier at ub.unibe.ch>>
Betreff: Re: [NTG-context] Ligature suppression word list


2. A bigger solution might be to use selnoligs patterns in a script
   that can be run over a large corpus, such as the DWDS (Digitales
   Wörterbuch der deutschen Sprache). That should produce us a more
   complete list of words where ligatures must be suppressed.

where is that DWDS ... i can write some code to deal with it (i'd rather start
from the source than from some interpretation; who know what more there
is to uncover)

As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus
They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de
especially here: https://wortschatz.uni-leipzig.de/de/download/German

There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, …



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20210412/3a050e31/attachment-0001.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: word-list.txt
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20210412/3a050e31/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: create-ligature-prevention-wordlist.txt
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20210412/3a050e31/attachment-0003.txt>

More information about the ntg-context mailing list