Hi! i noticed the extension to the allowed syntax inside \hyphenation which made it possible to use arbitrary discretionaries in the \hyphenation list (using the {pre-}{post}{normal} syntax) instead of just the most common {-}{}{} discretionary. some languages need a lot of discretionaries for proper hyphenation, and allowing to specify whole-word exceptions with arbitary discretionaries is already a good step forward, but i wonder if you plan to extend the \patterns mechanism to support arbitrary discretionaries too? Best, v.
Vladimir Volovich wrote:
some languages need a lot of discretionaries for proper hyphenation, and allowing to specify whole-word exceptions with arbitary discretionaries is already a good step forward, but i wonder if you plan to extend the \patterns mechanism to support arbitrary discretionaries too?
For sure. Feel free to propose a syntax ;-) Best wishes, Taco
Taco Hoekwater
Vladimir Volovich wrote:
some languages need a lot of discretionaries for proper hyphenation, and allowing to specify whole-word exceptions with arbitary discretionaries is already a good step forward, but i wonder if you plan to extend the \patterns mechanism to support arbitrary discretionaries too?
For sure. Feel free to propose a syntax ;-)
Of course, while we are at it, we want \patterns{a23b} to have the normal meaning of "2", but delivering a penalty corresponding to 70 when we have specified \hyphenpenalties 5 10 30 40 70 20 (similar to the widowpenalties multiplicity). That allows specifying breakpoints with various priorities (good and bad ones). Another wishlist item would be to be able to use the pattern mechanism for picking between long and short s forms (won't help in the case of "Wachstube" which can have either depending on meaning, but should work for most other cases). -- David Kastrup, Kriemhildstr. 15, 44793 Bochum
Another wishlist item would be to be able to use the pattern mechanism for picking between long and short s forms (won't help in the case of "Wachstube" which can have either depending on meaning, but should work for most other cases).
There's an interesting related issue: I've always wondered if it was possible to adapt the pattern mechanism to handle the cases where you don't wish to use a ligature in German (like in “auffallen”). I suppose it would need a different dictionary (you probably can't use the normal hyphenation patterns – that's my guess only), but it surely would be legitimate to instruct TeX where not to set ligatures in some languages. Then we can devise a similar dictionary to automatically correct the “oe” into “œ” in French :-) Arthur
Arthur Reutenauer wrote:
Another wishlist item would be to be able to use the pattern mechanism for picking between long and short s forms (won't help in the case of "Wachstube" which can have either depending on meaning, but should work for most other cases).
There's an interesting related issue: I've always wondered if it was possible to adapt the pattern mechanism to handle the cases where you don't wish to use a ligature in German (like in “auffallen”). I suppose it would need a different dictionary (you probably can't use the normal hyphenation patterns – that's my guess only), but it surely would be legitimate to instruct TeX where not to set ligatures in some languages.
Then we can devise a similar dictionary to automatically correct the “oe” into “œ” in French :-)
Certainly not the current \pattern mechanism itself, because that deals with inserting valid hyphenation points. Hyphenation in luatex is not related to fonts and ligatures. You could use <patterns> as created by patgen, but they would have to use a new primitive. If you want to experiment with something like this, write lua code that does it, and hook it into the "ligkern" caklback. Parsing pure patterns (without TeX catcodes) is very simple indeed. Best wishes, Taco
David Kastrup wrote:
Taco Hoekwater
writes: Vladimir Volovich wrote:
some languages need a lot of discretionaries for proper hyphenation, and allowing to specify whole-word exceptions with arbitary discretionaries is already a good step forward, but i wonder if you plan to extend the \patterns mechanism to support arbitrary discretionaries too? For sure. Feel free to propose a syntax ;-)
Of course, while we are at it, we want \patterns{a23b} to have the normal meaning of "2", but delivering a penalty corresponding to 70 when we have specified \hyphenpenalties 5 10 30 40 70 20 (similar to the widowpenalties multiplicity). That allows specifying breakpoints with various priorities (good and bad ones).
I am having a hard time parsing this. How does the "3" relate to the value "70"? Also, how do you propose to create such pattern files? Best wishes, Taco
Taco Hoekwater
David Kastrup wrote:
Taco Hoekwater
writes: Vladimir Volovich wrote:
some languages need a lot of discretionaries for proper hyphenation, and allowing to specify whole-word exceptions with arbitary discretionaries is already a good step forward, but i wonder if you plan to extend the \patterns mechanism to support arbitrary discretionaries too? For sure. Feel free to propose a syntax ;-)
Of course, while we are at it, we want \patterns{a23b} to have the normal meaning of "2", but delivering a penalty corresponding to 70 when we have specified \hyphenpenalties 5 10 30 40 70 20 (similar to the widowpenalties multiplicity). That allows specifying breakpoints with various priorities (good and bad ones).
I am having a hard time parsing this. How does the "3" relate to the value "70"? Also, how do you propose to create such pattern files?
70 is the fourth value (index 3) in the list of 5 penalty values. As to generating the pattern files: this depends on a hyphenation list with prioritized breakpoints (the printed Duden lexicon shows such breakpoints, good and emergency ones, so I presume that there might be some database somewhere). I don't know any money or urgency sitting around for a feature like that, though. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum
70 is the fourth value (index 3) in the list of 5 penalty values.
Actually no, it's the fifth of 6 values, but I figured it was the way it should be read :-)
this depends on a hyphenation list with prioritized breakpoints (the printed Duden lexicon shows such breakpoints, good and emergency ones, so I presume that there might be some database somewhere).
You're lucky, then; I'm not aware of any such list for French (and I doubt it would be possible for specialists to agree on a single list, but that's another problem). Arthur
Arthur Reutenauer
70 is the fourth value (index 3) in the list of 5 penalty values.
Actually no, it's the fifth of 6 values, but I figured it was the way it should be read :-)
No, 5 is the length of the list, not a list member. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum
Arthur Reutenauer
70 is the fourth value (index 3) in the list of 5 penalty values.
Actually no, it's the fifth of 6 values, but I figured it was the way it should be read :-)
this depends on a hyphenation list with prioritized breakpoints (the printed Duden lexicon shows such breakpoints, good and emergency ones, so I presume that there might be some database somewhere).
You're lucky, then; I'm not aware of any such list for French (and I doubt it would be possible for specialists to agree on a single list, but that's another problem).
Here at the DANTE conference I just learnt that Werner Lemberg is creating a large corpus of two separate "all hyphenations" and "main hyphenations" lists (about 400000 words IIRC) for German. So indeed it would appear that if LuaTeX offered hyphenation according to prioritized patterns, the data to make it typeset better documents in German would be reasonably well available. Werner, comments on that? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum
David Kastrup wrote:
Arthur Reutenauer
writes: 70 is the fourth value (index 3) in the list of 5 penalty values. Actually no, it's the fifth of 6 values, but I figured it was the way it should be read :-)
this depends on a hyphenation list with prioritized breakpoints (the printed Duden lexicon shows such breakpoints, good and emergency ones, so I presume that there might be some database somewhere).
You're lucky, then; I'm not aware of any such list for French (and I doubt it would be possible for specialists to agree on a single list, but that's another problem).
Here at the DANTE conference I just learnt that Werner Lemberg is creating a large corpus of two separate "all hyphenations" and "main hyphenations" lists (about 400000 words IIRC) for German. So indeed it would appear that if LuaTeX offered hyphenation according to prioritized patterns, the data to make it typeset better documents in German would be reasonably well available.
If there are two 'hyphenation levels', wouldn't it be easier if luatex supported running through two (or even more) separate pattern sets, and added the 'hitcount' to the discretionary? So breakpoint that appear in both sets of patterns would get an internal priority value of 2 instead of 1? Main advantage: no need for a patched or postprocessed patgen. Disadvantage: wastes a few CPU cycles because of multiple passes.
Taco Hoekwater
David Kastrup wrote:
Here at the DANTE conference I just learnt that Werner Lemberg is creating a large corpus of two separate "all hyphenations" and "main hyphenations" lists (about 400000 words IIRC) for German. So indeed it would appear that if LuaTeX offered hyphenation according to prioritized patterns, the data to make it typeset better documents in German would be reasonably well available.
If there are two 'hyphenation levels', wouldn't it be easier if luatex supported running through two (or even more) separate pattern sets, and added the 'hitcount' to the discretionary?
Easier on what account?
So breakpoint that appear in both sets of patterns would get an internal priority value of 2 instead of 1?
Main advantage: no need for a patched or postprocessed patgen.
Postprocessing is an obvious choice here.
Disadvantage: wastes a few CPU cycles because of multiple passes.
Well, hyphenation is not the fastest operation in the world. Doubling its runtime when one could instead add what amounts to an attribute to the final chosen point seems a bit pointless. On the other hand, running several patterns through, adding the valid points and making a decision based on that would allow to, say, choose a hyphen when it would look good in either English or German, or choose it when it's ok in 4 out of 5 selected European languages. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum
David Kastrup wrote:
Taco Hoekwater
writes: David Kastrup wrote:
Here at the DANTE conference I just learnt that Werner Lemberg is creating a large corpus of two separate "all hyphenations" and "main hyphenations" lists (about 400000 words IIRC) for German. So indeed it would appear that if LuaTeX offered hyphenation according to prioritized patterns, the data to make it typeset better documents in German would be reasonably well available. If there are two 'hyphenation levels', wouldn't it be easier if luatex supported running through two (or even more) separate pattern sets, and added the 'hitcount' to the discretionary?
Easier on what account?
Extending discretionary nodes is easier for me than extending the internal pattern data structure, I could program the whole multiple pass approach in only a few days. It is also easier in the sense that it can use existing patterns, no need to mess with patgen output, and no need for extensive testing of the postprocessed output. But if you can create these extended patterns, I'll wait for that.
Disadvantage: wastes a few CPU cycles because of multiple passes.
Well, hyphenation is not the fastest operation in the world.
It maxes out at about 10% runtime in a plain tex latin text document with all bells and whistles like protruding and hz turned off, a sane text with (about two-thirds of plain's default) while generating DVI. It is usually lower in other formats because of more work done by macros, or special features like HZ, or PDF output, or RL text, or math, or use of Opentype fonts. Hyphenation time is not negligible, and everything that slows the engine down warrants some discussion. But e.g. the ConTeXT mkiv code spends less than 1% of its runtime hyphenating. So we are not talking landslides either.
Doubling its runtime when one could instead add what amounts to an attribute to the final chosen point seems a bit pointless.
It will be a bit slower, but I doubt runtime will actually double. Some reasons why it will not be twice as bad: * this approach will slow down hyphenation a bit for languages that do not have these extended patterns * enlarging the pattern object data has a speed penalty also * there is more (programming as well as runtime) work needed to get the 'right' penalty than in the multiple pass case * discretionary nodes have to be enlarged because in this case you have to store actual penalties instead of hitcounts, otherwise there can be external changes to the penalty values It all depends on how hard it is to create these special patterns. Can you do that easily, or would it be a lot of work? Best wishes, Taco
[Resent. Sorry for duplicates.]
If there are two 'hyphenation levels',
For German, two hyphenation levels are not sufficient. Threem should be fine, however.
wouldn't it be easier if luatex supported running through two (or even more) separate pattern sets, and added the 'hitcount' to the discretionary? So breakpoint that appear in both sets of patterns would get an internal priority value of 2 instead of 1?
Sounds good. Werner
participants (5)
-
Arthur Reutenauer
-
David Kastrup
-
Taco Hoekwater
-
Vladimir Volovich
-
Werner LEMBERG