Hi everyone Now that Hans has implemented the new ligature suppression mechanism via language goodies - thanks again Hans! - we now need to come up with wordlists. I've started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist... You can find the current list here : https://github.com/denismaier/context-nolig-wordlist The list is currently organized as follows : 1. L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it's even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur. 2. L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f. 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature. 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft », « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk» Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included - either via mail or directly on github. More generally, there's the question how such a list should be enhanced? I was thinking about two options: 1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven't been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. 2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed. What do you think? Best, Denis
On Sat, Apr 03, 2021 at 03:06:22PM +0000, denis.maier@ub.unibe.ch wrote:
What do you think?
I think you should collaborate with the group of volunteers working on German hyphenation and related topics. They have a mailing list (in German): https://lists.dante.de/mailman/listinfo/trennmuster which is quite active and where Mico Loretan, the author of selnolig, occasionally posts. I’m sure they’ll be happy to help with suggestions and collaborative efforts, even if all of the main contributors use LaTeX. Arthur
On 4/3/2021 5:20 PM, Arthur Rosendahl wrote:
On Sat, Apr 03, 2021 at 03:06:22PM +0000, denis.maier@ub.unibe.ch wrote:
What do you think?
I think you should collaborate with the group of volunteers working on German hyphenation and related topics. They have a mailing list (in German): https://lists.dante.de/mailman/listinfo/trennmuster which is quite active and where Mico Loretan, the author of selnolig, occasionally posts. I’m sure they’ll be happy to help with suggestions and collaborative efforts, even if all of the main contributors use LaTeX.
german is just an example, dutch has some specific things, and i bet other languages have their demands so my aim is some general mechanism (for which much is already in place btw) ... we're talking of a what i tag as 'languages goodies' just like we have 'font goodies' .. a plug in system Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Sat, Apr 03, 2021 at 06:02:10PM +0200, Hans Hagen wrote:
german is just an example, dutch has some specific things, and i bet other languages have their demands so my aim is some general mechanism
I appreciate that, but if you want to have data of sufficiently good quality to use this mechanism for individual languages, you need to invest a *lot* of time for each one of them. German is one of the very few languages I know of that has an active group of people working to produce that data, the “Trennmuster people”, as Mojca calls them ;-) Their word list supports many fine points of typography, even those that few programs can use, for example weighted hyphenation. Ligature prevention came in as a side project. Dutch, by contrast, does not seem so well served: the OpenTaal group is dormant and no longer offers the hyphenated word list that was once available (that was already the case five years ago). The most relevant page I find: https://www.opentaal.org/projecten/woordafbreking is from 2009. There have apparently been recent updates by a single person (who incidentally sometimes contributes to the German hyphenation working group), but they’re rather generic. Best, Arthur
On 4/8/2021 9:37 PM, Arthur Rosendahl wrote:
Dutch, by contrast, does not seem so well served: the OpenTaal group is dormant and no longer offers the hyphenated word list that was once available (that was already the case five years ago). The most relevant page I find: https://www.opentaal.org/projecten/woordafbreking is from 2009. There have apparently been recent updates by a single person (who incidentally sometimes contributes to the German hyphenation working group), but they’re rather generic. fwiw: They are active in collecting words (they also do stuff for open office). Dutch patterns don't chaneg much because the hyphenation is syllable based and predictable enough I think. There haven't been that many released of dutch patterns.
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
Hi everyone
Now that Hans has implemented the new ligature suppression mechanism via language goodies – thanks again Hans! – we now need to come up with wordlists.
I’ve started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist... https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wordlist...
You can find the current list here : https://github.com/denismaier/context-nolig-wordlist https://github.com/denismaier/context-nolig-wordlist
The list is currently organized as follows :
1. L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it’s even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur. 2. L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f. 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature. 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft », « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included – either via mail or directly on github.
More generally, there’s the question how such a list should be enhanced? I was thinking about two options:
1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven’t been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. 2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start from the source than from some interpretation; who know what more there is to uncover) additional info: we're talking of a mechanism sort of integrated in the hyphenation loop, where we can also handle compound words, if needed with details about how influence to hyphenate these) so the above question involves: - exceptions to exceptions - replacements before hyphenation - compound words (including lhmin/rhmin overloads) - (left right two sided) ligature and/or kern prevention and whatever we like/need more (within reasonable bounds), Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
-----Ursprüngliche Nachricht----- Von: Hans Hagen
Gesendet: Samstag, 3. April 2021 17:58 An: mailing list for ConTeXt users ; Maier, Denis Christian (UB) Betreff: Re: [NTG-context] Ligature suppression word list On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
Hi everyone
Now that Hans has implemented the new ligature suppression mechanism via language goodies - thanks again Hans! - we now need to come up with wordlists.
I've started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo rdlist.tex <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w ordlist.tex>
You can find the current list here : https://github.com/denismaier/context-nolig-wordlist https://github.com/denismaier/context-nolig-wordlist
The list is currently organized as follows :
1. L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it's even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur. 2. L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f. 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature. 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft », « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included - either via mail or directly on github.
More generally, there's the question how such a list should be enhanced? I was thinking about two options:
1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven't been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. 2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start from the source than from some interpretation; who know what more there is to uncover)
The DWDS is here: https://www.dwds.de/ But I still need to check how we can extract the words from there... Denis
-----Ursprüngliche Nachricht----- Von: Hans Hagen
Gesendet: Samstag, 3. April 2021 17:58 An: mailing list for ConTeXt users ; Maier, Denis Christian (UB) Betreff: Re: [NTG-context] Ligature suppression word list On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
Hi everyone
Now that Hans has implemented the new ligature suppression mechanism via language goodies - thanks again Hans! - we now need to come up with wordlists.
I've started working on a list of German words with ligatures that should be suppressed. The list is derived from the word list that comes with the lualatex selnolig package: https://github.com/micoloretan/selnolig/blob/master/selnolig-german-wo rdlist.tex <https://github.com/micoloretan/selnolig/blob/master/selnolig-german-w ordlist.tex>
You can find the current list here : https://github.com/denismaier/context-nolig-wordlist https://github.com/denismaier/context-nolig-wordlist
The list is currently organized as follows :
1. L.25-l.35: This specifies words where automatic pattern matching is more difficult than usually because the words contain multiple ligatures, some of which must be suppressed while others must be preserved. In the case of « Auflagefläche » it's even the same combination of letters. So here, we use the bar | to manually indicate points where no ligature must occur. 2. L. 36ff.: The vast amount of words is currently in that list that specifies words where a ff, fl, fi, ffi, or ffl ligature has to be broken up after the first f. 3. L.1804ff contain words where ffi, ffl, or fff ligatures have to be prevented after the second f, so the first two fs form a ligature. 4. The remaining blocks starting at L.1900, l. 2073, l. 2157, l. 2225, and l. 2277 suppress ligatures for « ft » and « fft », « fb » and « ffb », « fh » and « ffh», «fj» and «ffj», and «fk» and «ffk»
Obviously, that list is far from being complete, and the question is if it ever can be. Please have a look and feel free to propose more words to be included - either via mail or directly on github.
More generally, there's the question how such a list should be enhanced? I was thinking about two options:
1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven't been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. 2. A bigger solution might be to use selnoligs patterns in a script that can be run over a large corpus, such as the DWDS (Digitales Wörterbuch der deutschen Sprache). That should produce us a more complete list of words where ligatures must be suppressed.
where is that DWDS ... i can write some code to deal with it (i'd rather start from the source than from some interpretation; who know what more there is to uncover)
As it turn out, the linguists that helped with the selnolig package did use another corpus: Stuttgart "Deutsch" Web as Corpus They describe their approach in that paper: https://raw.githubusercontent.com/SHildebrandt/selnolig-check/master/selnoli... Denis
On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
1. The new language options features include a tracker that allows for tracking for which words in a given document ligature prevention happened, and which words haven’t been touched by the mechanism. It should be possible to analyze the log file and to create lists of words with ligatures. Should be a rather simple step to derive new words for the ligature-suppression wordlist. I already have some code for that but can't make you an update (garden is / will be down for some days due to maintenance).
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
A starting list of English non-ligatures: https://english.stackexchange.com/a/50957/22099 The entire SE thread has additional resources and is quite informative.
On 4/3/2021 6:30 PM, Thangalin wrote:
A starting list of English non-ligatures:
https://english.stackexchange.com/a/50957/22099 https://english.stackexchange.com/a/50957/22099
The entire SE thread has additional resources and is quite informative. So can you make a file from that like we made as starting point for German?
Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Untested. Lists are not subject to copyright, so public domain should be legal, even though SE posts are CC-BY-SA. When a word has a single suffix or prefix (e.g., safflower/s), the two words are listed together, rather than using an explicit suffix/prefix section. return { name = "english", version = "1.00", comment = "English ligature suppression", author = "Mico Loretan, Dave Jarvis, & Hans Hagen", copyright = "Public domain", options = { { actions = { ["|"] = "noligature" }, words = [[ ]], }, { patterns = { fi = "f|i", fl = "f|l", }, words = [[ -- f|i deafish dwarfish elfish oafish selfish serfish unselfish wolfish -- f|l beefless briefless hoofless leafless roofless selfless turfless ]], suffixes = [[ ness ly ]], }, { patterns = { fi = "f|i", }, words = [[ proofing ]], prefixes = [[ air- child- fire- flame- moth- rust- sound- water- weather- ]], }, { patterns = { ff = "f|f", fi = "f|i", fl = "f|l", ffi = "f|fi", ffl = "f|fl", }, words = [[ -- f|f bookshelfful mantelshelfful shelfful -- f|i elfin chafing leafing loafing sheafing strafing vouchsafing beefing reefing briefing debriefing coifing fifing jackknifing knifing midwifing waifing wifing goofing hoofing roofing reroofing spoofing whoofing woofing gulfing begulfing engulfing ingulfing golfing gulfing rolfing selfing wolfing barfing bedwarfing dwarfing enserfing kerfing scarfing snarfing surfing windsurfing turfing wharfing beefier comfier goofier gulfier leafier surfier turfier beefiest comfiest goofiest gulfiest leafiest surfiest turfiest beefily goofily goofiness -- f|l aloofly briefly chiefly deafly liefly calflike dwarflike elflike gulflike hooflike leaflike rooflike serflike sheaflike shelflike surflike turflike waiflike wolflike halflife shelflife halfline roofline leaflet leaflets leafleted leafleting leafletting leafletted leafleteer pdflatex -- f|fi chaffinch wolffish -- f|fl safflower safflowers ]], }, { patterns = { ffi = "ff|i", }, words = [[ -- ff|i cuffing ]], prefixes = [[ hand un ]], }, { patterns = { ffi = "ff|i", }, words = [[ -- ff|i feoffing ]], prefixes = [[ en in ]], }, { patterns = { ffi = "ff|i", }, words = [[ -- ff|i staffing stuffing ]], prefixes = [[ re over under ]], }, { patterns = { ffi = "ff|i", }, words = [[ -- ff|i ruffing ]], prefixes = [[ cross over under ]], }, { patterns = { ffi = "ff|i", ffl = "ff|l", }, words = [[ -- ff|i draffish giraffish gruffish offish raffish sniffish standoffish stiffish toffish -- ff|l cuffless stuffless ]], suffixes = [[ ly ]], }, { patterns = { ffl = "ff|l", }, words = [[ -- ff|l scofflaw cufflink offline offload ]], suffixes = [[ s ed ing ]], }, { patterns = { ffi = "ff|i", ffl = "ff|l", }, words = [[ -- ff|i baffing biffing boffing bluffing outbluffing buffing rebuffing chaffing cheffing chuffing coffing coiffing daffing doffing fluffing gaffing gruffing huffing luffing miffing muffing offing piaffing puffing quaffing reffing riffing sclaffing scoffing scuffing shroffing sluffing sniffing snuffing spiffing stiffing stuffing tariffing tiffing waffing whiffing yaffing buffier chaffier chuffier cliffier daffier fluffier gruffier huffier iffier miffier puffier scruffier sniffier snuffier spiffier stuffier buffiest chaffiest chuffiest cliffiest daffiest fluffiest gruffiest huffiest iffiest miffiest puffiest scruffiest sniffiest snuffiest spiffiest stuffiest daffily fluffily gruffily huffily puffily scruffily sniffily snuffily spiffily stuffily fluffiness huffiness iffiness puffiness scruffiness sniffiness spiffiness stuffiness baffies biffies jiffies taffies toffies waffie Pfaffian Wolffian Wulffian -- ff|l bluffly gruffly ruffly snuffly stiffly rufflike clifflike ]], }, { patterns = { ft = "f|t", fft = "ff|t", }, words = [[ -- f|t chieftain chieftains chieftaincy chieftainship fifteen fifteens fifteenth fifteenths fifth fifthly fifths fifties fiftieth fiftieths fifty fiftyish halftime halftone rooftop rooftops rooftree -- ff|t offtrack ]] } } }
On 4/3/2021 5:06 PM, denis.maier@ub.unibe.ch wrote:
For those interested, that file only has ligature prevention definitions. { actions = { ["|"] = "noligature" }, words = [[ Auf|lagefläche Auf|lageflächen Auf|lagenziffer Auf|lagenziffern ]], }, can be (lig prevention already in words): { words = [[ Auf|lagefläche Auf|lageflächen Auf|lagenziffer Auf|lagenziffern ]], }, or the more efficient (first match only): { actions = { ["|"] = "noligature" }, matches = { 1 } words = [[ Auflagefläche Auflageflächen Auflagenziffer Auflagenziffern ]], }, or if you want all matches: { actions = { ["|"] = "noligature" }, words = [[ Auflagefläche Auflageflächen Auflagenziffer Auflagenziffern ]], }, or when you want no kerns either (of course on can also use the petterns key): actions = { ["|"] = "noligature nokern" }, words = [[ ef|fe ]], }, btw, user will also be able to do this in a document source \startlanguageoptions[de] Zapf|innovation whatever+innovation \stoplanguageoptions ligature prevention in the first and compound word in the next one. so, one way to see what we need is if users try to analyze their 'exceptions' if they have them defined at all, so that we can spot possible tricks needed, (i might actually combine this with exceptions that normally come after this stage) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (4)
-
Arthur Rosendahl
-
denis.maier@ub.unibe.ch
-
Hans Hagen
-
Thangalin