[NTG-context] Ligature suppression word list

rha17 at t-online.de rha17 at t-online.de
Wed Apr 7 20:19:52 CEST 2021


> Message: 2
> Date: Tue, 6 Apr 2021 15:03:54 +0000
> From: <denis.maier at ub.unibe.ch <mailto:denis.maier at ub.unibe.ch>>
> To: <j.hagen at xs4all.nl <mailto:j.hagen at xs4all.nl>>, <ntg-context at ntg.nl <mailto:ntg-context at ntg.nl>>
> Subject: Re: [NTG-context] Ligature suppression word list
> Message-ID: <41e6530172b54bffb7a82febff0a6be5 at ub.unibe.ch <mailto:41e6530172b54bffb7a82febff0a6be5 at ub.unibe.ch>>
> Content-Type: text/plain; charset="iso-8859-1"
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Hans Hagen <j.hagen at xs4all.nl <mailto:j.hagen at xs4all.nl>>
>> Gesendet: Samstag, 3. April 2021 17:58
>> An: mailing list for ConTeXt users <ntg-context at ntg.nl <mailto:ntg-context at ntg.nl>>; Maier, Denis
>> Christian (UB) <denis.maier at ub.unibe.ch <mailto:denis.maier at ub.unibe.ch>>
>> Betreff: Re: [NTG-context] Ligature suppression word list

[…]

>> 
>>> 2. A bigger solution might be to use selnoligs patterns in a script
>>>    that can be run over a large corpus, such as the DWDS (Digitales
>>>    Wörterbuch der deutschen Sprache). That should produce us a more
>>>    complete list of words where ligatures must be suppressed.
>> 
>> where is that DWDS ... i can write some code to deal with it (i'd rather start
>> from the source than from some interpretation; who know what more there
>> is to uncover)
> 
> As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus
> They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf <https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnolig-check-documentation.pdf>

A lot of  corpora can be found here: https://wortschatz.uni-leipzig.de/de <https://wortschatz.uni-leipzig.de/de>
especially here: https://wortschatz.uni-leipzig.de/de/download/German <https://wortschatz.uni-leipzig.de/de/download/German>

There are corpora of many other languages, too, such as English, French, Dutch, Spanish, Russian, Japanese, Latin, …

HTH

Ralf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20210407/737b4d6c/attachment.htm>


More information about the ntg-context mailing list