[NTG-context] Hyphenation patterns

Arthur Rosendahl arthur.reutenauer at normalesup.org
Fri Apr 9 23:57:56 CEST 2021


  Denis’ latest question reminded me of an earlier query he had about
hyphenation, asking why “applicable” and “obligated” were hyphenated by
ConTeXt as ap-plic-a-ble and ob-lig-at-ed, and not ap-pli-ca-ble and
ob-li-ga-te(d) like in Merriam-Webster (the discussion started at
https://mailman.ntg.nl/pipermail/ntg-context/2020/099695.html).

  First of all, I note that while Webster’s dictionary is a useful
guide, and indeed a major reference for any American typographer,
there’s no absolute rule that we have to follow it either.  The break
applic-able, for example, does look acceptable to me; oblig-ated, less
so.

  Taco reminded that when producing a set of hyphenation patterns from a
list of hyphenated words, we’re essentially compressing information, and
that some minor deviations are to be expected.  However, in my
experience, unexpected breakpoints are almost never due to chance, but
to a deliberate decision.

  Then Hraban said that:

On Fri, Oct 09, 2020 at 10:15:17AM +0200, Henning Hraban Ramm wrote:
> Usually Arthur’s (hail the emperor of hyphenation and protector of the patterns) patterns are flawless, so I guess it’s not a bug but an exception of the rules.

  I see that my self-appointed title is catching on, nice :-)
Unfortunately the patterns are just as likely to contain errors as
anything else, and in this particular case we’ll probably never know for
sure, because the original hyphenated word list was never published (all
the word lists from which patterns were produced in the 80s and 90s have
been lost, for all languages).  We’re thus reduced to guessing the
intent of those who compiled the lists.

  We can get hints from looking at the patterns involved in the
debatable breaks.  Hans has a useful script:

	$ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate applicable
	hyphenator      |
	hyphenator      | . a p p l i c a b l e .   . a p p l i c a b l e .  
	hyphenator      |    4p1p0                   0 4 1 0 0 0 0 0 0 0 0  
	hyphenator      |      1p2l2                 0 4 1 2 2 0 0 0 0 0 0  
	hyphenator      |      0p0l0i2c1a0b0         0 4 1 2 2 2 1 0 0 0 0  
	hyphenator      |            1c0a0           0 4 1 2 2 2 1 0 0 0 0  
	hyphenator      |            0c0a1b0l0       0 4 1 2 2 2 1 1 0 0 0  
	hyphenator      |                0b2l2       0 4 1 2 2 2 1 1 2 2 0  
	hyphenator      |                0b4l0e0.0   0 4 1 2 2 2 1 1 4 2 0  
	hyphenator      | .0a4p1p2l2i2c1a1b4l2e0.   . a p-p l i c-a-b l e .  
	hyphenator      |
	mtx-patterns    | us 2 2 : applicable : ap-plic-a-ble

  That tells us that there are seven patterns involved in hyphenating
the word applicable: 4p1, 1p2l2, pli2c1ab, 1ca, ca1bl, b2l2, and b4le.
(the final dot is part of that last pattern).  The pattern responsible
for the break applic-able is pli2c1ab.  If we now refer to the source
repository for hyphenation patterns (since comments are stripped in the
ConTeXt sources): https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-en-us.tex
-- we can see line 4508

	hyphen.tex patterns end here, and additional patterns begin:

which means that the pattern pli2c1ab, line 4817, is an “additional
pattern”.  The background story is that hyphen.tex, the original
hyphenation pattern file for American English, produced in 1982-1983
from a list of hyphenated words (following mostly Webster’s), was later
augmented with more patterns that were supposed to improve hyphenation
for many words.  The person who added these new patterns apparently had
a list of words hyphenated incorrectly (according to him) by hyphen.tex,
but both that list and the one used to produce hyphen.tex are as
mentioned above now lost, probably forever.

  In any case, the pattern that causes the break applic-able was clearly
added intentionally; and as I said that break seems quite reasonable to
me.  Not so for the one in oblig-ated, so let’s have a look at that:

	$ mtxrun --script patterns --language=us --left=2 --right=2 --hyphenate obligated
	hyphenator      |
	hyphenator      | . o b l i g a t e d .   . o b l i g a t e d .  
	hyphenator      |  0o0b0l0i2g1             0 0 0 0 2 1 0 0 0 0  
	hyphenator      |    0b2l2                 0 0 2 2 2 1 0 0 0 0  
	hyphenator      |      5l0i0g0a0t0e0       0 0 5 2 2 1 0 0 0 0  
	hyphenator      |        2i0g0             0 0 5 2 2 1 0 0 0 0  
	hyphenator      |          1g0a0           0 0 5 2 2 1 0 0 0 0  
	hyphenator      |              2t1e0d0     0 0 5 2 2 1 2 1 0 0  
	hyphenator      | .0o0b5l2i2g1a2t1e0d0.   . o b-l i g-a t-e d .  
	hyphenator      |
	mtx-patterns    | us 2 2 : obligated : ob-lig-at-ed

  Here we see that the dubious break is caused by the pattern obli2g1,
also an “additional pattern” (line 4783), and here it’s not hard to
guess where it comes from: it has to be for the word obligatory,
hyphenated regularly as o-blig-a-to-ry according to M-W -- and myself ;-)
The incorrect breakpoint in obli-gated is an undesired side effect of
that.

	Best,

		ArthuR


More information about the ntg-context mailing list