Re: [Dev-luatex] generalized hyphenation

7 Mar 2008

      David Kastrup wrote:
...
Taco Hoekwater  writes:
...
David Kastrup wrote:
...
Here at the DANTE conference I just learnt that Werner Lemberg is
creating a large corpus of two separate "all hyphenations" and "main
hyphenations" lists (about 400000 words IIRC) for German.  So indeed
it would appear that if LuaTeX offered hyphenation according to
prioritized patterns, the data to make it typeset better documents in
German would be reasonably well available.
If there are two 'hyphenation levels', wouldn't it be easier if luatex
supported running through two (or even more) separate pattern sets,
and added the 'hitcount' to the discretionary?
Easier on what account?
Extending discretionary nodes is easier for me than extending
the internal pattern data structure, I could program the whole
multiple pass approach in only a few days.

It is also easier in the sense that it can use existing patterns,
no need to mess with patgen output, and no need for extensive
testing of the postprocessed output. But if you can create these
extended patterns, I'll wait for that.
...
...
Disadvantage: wastes a few CPU cycles because of multiple passes.
Well, hyphenation is not the fastest operation in the world.
It maxes out at about 10% runtime in a plain tex latin text document
with all bells and whistles like protruding and hz turned off,
a sane text with (about two-thirds of plain's default) while
generating DVI.

It is usually lower in other formats because of more work done by
macros, or special features like HZ, or PDF output, or RL text,
or math, or use of Opentype fonts. Hyphenation time is not negligible,
and everything that slows the engine down warrants some discussion.

But e.g. the ConTeXT mkiv code spends less than 1% of its runtime
hyphenating.  So we are not talking landslides either.
...
Doubling
its runtime when one could instead add what amounts to an attribute to
the final chosen point seems a bit pointless.
It will be a bit slower, but I doubt runtime will actually double.

Some reasons why it will not be twice as bad:
* this approach will slow down hyphenation a bit for languages that
   do not have these extended patterns
* enlarging the pattern object data has a speed penalty also
* there is more (programming as well as runtime) work needed to get
   the 'right' penalty than in the multiple pass case
* discretionary nodes have to be enlarged because in this case you
   have to store actual penalties instead of hitcounts, otherwise there
   can be external changes to the penalty values

It all depends on how hard it is to create these special patterns.
Can you do that easily, or would it be a lot of work?

Best wishes,
Taco