Denis’ latest question reminded me of an earlier query he had about hyphenation, asking why “applicable” and “obligated” were hyphenated by ConTeXt as ap-plic-a-ble and ob-lig-at-ed, and not ap-pli-ca-ble and ob-li-ga-te(d) like in Merriam-Webster (the discussion started at https://mailman.ntg.nl/pipermail/ntg-context/2020/099695.html). First of all, I note that while Webster’s dictionary is a useful guide, and indeed a major reference for any American typographer, there’s no absolute rule that we have to follow it either. The break applic-able, for example, does look acceptable to me; oblig-ated, less so. Taco reminded that when producing a set of hyphenation patterns from a list of hyphenated words, we’re essentially compressing information, and that some minor deviations are to be expected. However, in my experience, unexpected breakpoints are almost never due to chance, but to a deliberate decision. Then Hraban said that: On Fri, Oct 09, 2020 at 10:15:17AM +0200, Henning Hraban Ramm wrote:
Usually Arthur’s (hail the emperor of hyphenation and protector of the patterns) patterns are flawless, so I guess it’s not a bug but an exception of the rules.
I see that my self-appointed title is catching on, nice :-) Unfortunately the patterns are just as likely to contain errors as anything else, and in this particular case we’ll probably never know for sure, because the original hyphenated word list was never published (all the word lists from which patterns were produced in the 80s and 90s have been lost, for all languages). We’re thus reduced to guessing the intent of those who compiled the lists. We can get hints from looking at the patterns involved in the debatable breaks. Hans has a useful script: $ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate applicable hyphenator | hyphenator | . a p p l i c a b l e . . a p p l i c a b l e . hyphenator | 4p1p0 0 4 1 0 0 0 0 0 0 0 0 hyphenator | 1p2l2 0 4 1 2 2 0 0 0 0 0 0 hyphenator | 0p0l0i2c1a0b0 0 4 1 2 2 2 1 0 0 0 0 hyphenator | 1c0a0 0 4 1 2 2 2 1 0 0 0 0 hyphenator | 0c0a1b0l0 0 4 1 2 2 2 1 1 0 0 0 hyphenator | 0b2l2 0 4 1 2 2 2 1 1 2 2 0 hyphenator | 0b4l0e0.0 0 4 1 2 2 2 1 1 4 2 0 hyphenator | .0a4p1p2l2i2c1a1b4l2e0. . a p-p l i c-a-b l e . hyphenator | mtx-patterns | us 2 2 : applicable : ap-plic-a-ble That tells us that there are seven patterns involved in hyphenating the word applicable: 4p1, 1p2l2, pli2c1ab, 1ca, ca1bl, b2l2, and b4le. (the final dot is part of that last pattern). The pattern responsible for the break applic-able is pli2c1ab. If we now refer to the source repository for hyphenation patterns (since comments are stripped in the ConTeXt sources): https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/... -- we can see line 4508 hyphen.tex patterns end here, and additional patterns begin: which means that the pattern pli2c1ab, line 4817, is an “additional pattern”. The background story is that hyphen.tex, the original hyphenation pattern file for American English, produced in 1982-1983 from a list of hyphenated words (following mostly Webster’s), was later augmented with more patterns that were supposed to improve hyphenation for many words. The person who added these new patterns apparently had a list of words hyphenated incorrectly (according to him) by hyphen.tex, but both that list and the one used to produce hyphen.tex are as mentioned above now lost, probably forever. In any case, the pattern that causes the break applic-able was clearly added intentionally; and as I said that break seems quite reasonable to me. Not so for the one in oblig-ated, so let’s have a look at that: $ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate obligated hyphenator | hyphenator | . o b l i g a t e d . . o b l i g a t e d . hyphenator | 0o0b0l0i2g1 0 0 0 0 2 1 0 0 0 0 hyphenator | 0b2l2 0 0 2 2 2 1 0 0 0 0 hyphenator | 5l0i0g0a0t0e0 0 0 5 2 2 1 0 0 0 0 hyphenator | 2i0g0 0 0 5 2 2 1 0 0 0 0 hyphenator | 1g0a0 0 0 5 2 2 1 0 0 0 0 hyphenator | 2t1e0d0 0 0 5 2 2 1 2 1 0 0 hyphenator | .0o0b5l2i2g1a2t1e0d0. . o b-l i g-a t-e d . hyphenator | mtx-patterns | us 2 2 : obligated : ob-lig-at-ed Here we see that the dubious break is caused by the pattern obli2g1, also an “additional pattern” (line 4783), and here it’s not hard to guess where it comes from: it has to be for the word obligatory, hyphenated regularly as o-blig-a-to-ry according to M-W -- and myself ;-) The incorrect breakpoint in obli-gated is an undesired side effect of that. Best, ArthuR