Hello all, I just came across this library: http://sourceforge.net/projects/hunspell/files/Hyphen/ Which seems to be the last iteration of libhj, and is currently used by OpenOffice/LibreOffice. It seems to have some interesting feature over the original hyphenation algorithm, namely support for "non-standard hyphenation; `discretionary' character changes at hyphenation points" and "compound word hyphenation and special rules of compound word hyphenation of German languages and other languages with arbitrary number of compound words." I recall reading about something related to German hyphenation planned for luatex, so I sought this might be of some interest here. Regards, Khaled -- Khaled Hosny Egyptian Arab
On 09/15/2011 10:56 PM, Khaled Hosny wrote:
Hello all,
I just came across this library: http://sourceforge.net/projects/hunspell/files/Hyphen/
Which seems to be the last iteration of libhj, and is currently used by OpenOffice/LibreOffice. It seems to have some interesting feature over the original hyphenation algorithm, namely support for "non-standard hyphenation; `discretionary' character changes at hyphenation points" and "compound word hyphenation and special rules of compound word hyphenation of German languages and other languages with arbitrary number of compound words."
Hyphenation is luatex is in fact an adaptation of a (slightly earlier) version of libhnj. At that time, it did not do compound word stuff yet, so I have to check that out. It did then already have non-standard hyphenation. However, that was implemented as such an hack that I decided to leave it out in the new luatex code, and instead opted for non-standard hyphenation in the exceptions instead of in the patterns proper. (what libhnj did at that time was disguising dictionary exceptions as patterns, so the non-standard hyphenation 'pattern rules' were in fact complete words with a single non-standard hyphenation in it somewhere.) Considering the quality of the non-standard hyphenation support, I do not have high expectations for the compound word extension, to be honest. Best wishes, Taco
On Fri, Sep 16, 2011 at 08:07:58AM +0200, Taco Hoekwater wrote:
On 09/15/2011 10:56 PM, Khaled Hosny wrote:
Hello all,
I just came across this library: http://sourceforge.net/projects/hunspell/files/Hyphen/
Which seems to be the last iteration of libhj, and is currently used by OpenOffice/LibreOffice. It seems to have some interesting feature over the original hyphenation algorithm, namely support for "non-standard hyphenation; `discretionary' character changes at hyphenation points" and "compound word hyphenation and special rules of compound word hyphenation of German languages and other languages with arbitrary number of compound words."
Hyphenation is luatex is in fact an adaptation of a (slightly earlier) version of libhnj. At that time, it did not do compound word stuff yet, so I have to check that out. It did then already have non-standard hyphenation.
I know that luatex is using libhnj, that is why I though this new development might be of interest.
However, that was implemented as such an hack that I decided to leave it out in the new luatex code, and instead opted for non-standard hyphenation in the exceptions instead of in the patterns proper. (what libhnj did at that time was disguising dictionary exceptions as patterns, so the non-standard hyphenation 'pattern rules' were in fact complete words with a single non-standard hyphenation in it somewhere.)
They have tis TUGboat paper on non-standard hyphenation, I'm not sure if it is the old libhnj thing, it is all gibberish to me :) http://sourceforge.net/projects/hunspell/files/Hyphen/documentation/tb87neme... Regards, Khaled -- Khaled Hosny Egyptian Arab
schrieb Taco Hoekwater:
On 09/15/2011 10:56 PM, Khaled Hosny wrote:
I just came across this library: http://sourceforge.net/projects/hunspell/files/Hyphen/
Hyphenation is luatex is in fact an adaptation of a (slightly earlier) version of libhnj. At that time, it did not do compound word stuff yet, so I have to check that out. It did then already have non-standard hyphenation.
However, that was implemented as such an hack that I decided to leave it out in the new luatex code, and instead opted for non-standard hyphenation in the exceptions instead of in the patterns proper. (what libhnj did at that time was disguising dictionary exceptions as patterns, so the non-standard hyphenation 'pattern rules' were in fact complete words with a single non-standard hyphenation in it somewhere.)
As Taco already pointed out libhnj mixes-up regular patterns and non-standard hyphenation patterns. I sent a proposal about compound word hyphenation to Taco a while ago that clearly separates patterns with different semantics. In this context, different semantics means different hyphenation penalties. That is, provide different sets of hyphenation patterns for all needed hyphenation penalties, i.e, patterns * for compound word hyphenation, * for prefix and suffix hyphenation, * for suppressing aesthetically unpleasant hyphenations, * etc. For the German language, I think even more than five different penalty classes could be desirable. All these sets of patterns can be applied to a word in parallel and the penalties are chosen according to which pattern set matches a spot. The same pattern approach can be used to handle non-standard hyphenation, ligaturing, round-s recognition, etc. (I think there are use-cases in Arabic script as well.) I don't know what Taco's current plan is, though. The corresponding tracker item reads "multi-pass hyphenation", URL:http://tracker.luatex.org/view.php?id=168, whereas my proposal is about applying patterns in parallel rather than in multiple passes. Best regards, Stephan Hennig
On Wed, Sep 21, 2011 at 01:00:08AM +0200, Stephan Hennig wrote:
schrieb Taco Hoekwater:
On 09/15/2011 10:56 PM, Khaled Hosny wrote:
I just came across this library: http://sourceforge.net/projects/hunspell/files/Hyphen/
Hyphenation is luatex is in fact an adaptation of a (slightly earlier) version of libhnj. At that time, it did not do compound word stuff yet, so I have to check that out. It did then already have non-standard hyphenation.
However, that was implemented as such an hack that I decided to leave it out in the new luatex code, and instead opted for non-standard hyphenation in the exceptions instead of in the patterns proper. (what libhnj did at that time was disguising dictionary exceptions as patterns, so the non-standard hyphenation 'pattern rules' were in fact complete words with a single non-standard hyphenation in it somewhere.)
As Taco already pointed out libhnj mixes-up regular patterns and non-standard hyphenation patterns. I sent a proposal about compound word hyphenation to Taco a while ago that clearly separates patterns with different semantics.
In this context, different semantics means different hyphenation penalties. That is, provide different sets of hyphenation patterns for all needed hyphenation penalties, i.e, patterns
* for compound word hyphenation, * for prefix and suffix hyphenation, * for suppressing aesthetically unpleasant hyphenations, * etc.
For the German language, I think even more than five different penalty classes could be desirable. All these sets of patterns can be applied to a word in parallel and the penalties are chosen according to which pattern set matches a spot.
The same pattern approach can be used to handle non-standard hyphenation, ligaturing, round-s recognition, etc. (I think there are use-cases in Arabic script as well.)
I don't know what Taco's current plan is, though. The corresponding tracker item reads "multi-pass hyphenation", URL:http://tracker.luatex.org/view.php?id=168, whereas my proposal is about applying patterns in parallel rather than in multiple passes.
I've to admit this is all Greek to me :) (the only language written in Arabic script that accepts hyphenation is Uyghur, and that is for the new, Chinese-imposed orthography). I just thought re-using existing code and patterns might be of interest, but I can't judge the quality of either of them. Regards, Khaled -- Khaled Hosny Egyptian Arab
schrieb Khaled Hosny:
On Wed, Sep 21, 2011 at 01:00:08AM +0200, Stephan Hennig wrote:
The same pattern approach can be used to handle non-standard hyphenation, ligaturing, round-s recognition, etc. (I think there are use-cases in Arabic script as well.)
I've to admit this is all Greek to me :) (the only language written in Arabic script that accepts hyphenation is Uyghur, and that is for the new, Chinese-imposed orthography).
I didn't mean hyphenation in Arabic script. In Latin black letter script there are two different glyphs for the small letter s -- the long ſ and the usual round s. Most keyboards don't provide a key for the ſ, so source documents usually make use of the round s only. To put an ſ at the respective places in the typeset document, the traditional way is to mark-up those places with s+ or s: (different black letter fonts and support packages use different conventions). Now, automatically applying glyph substitution at the correct places within a character stream /without/ mark-up is pretty much the same problem as finding hyphenation positions within a character stream without mark-up. The same holds for applying non-standard hyphenation without mark-up or applying ligatures at the correct places only without mark-up. Currently, TeX inserts ligatures based on a greedy rule, which produces many false positives in languages with compound words. For Arabic script, I referred to glyph substitution. But I don't know enough about Arabic script to explain further. :) Best regards, Stephan Hennig
On Wed, Sep 21, 2011 at 06:22:18PM +0200, Stephan Hennig wrote:
For Arabic script, I referred to glyph substitution. But I don't know enough about Arabic script to explain further. :)
Arabic handling in luatex (or rather luatex based packages) is done by OpenType layout features processed by lua code, no engine techniques are involved. Regards, Khaled -- Khaled Hosny Egyptian Arab
schrieb Khaled Hosny:
On Wed, Sep 21, 2011 at 06:22:18PM +0200, Stephan Hennig wrote:
For Arabic script, I referred to glyph substitution. But I don't know enough about Arabic script to explain further. :)
Arabic handling in luatex (or rather luatex based packages) is done by OpenType layout features processed by lua code, no engine techniques are involved.
Didn't I warn you, I don't know much about Arabic script? :) What OpenType features are there that are used in Arabic script? Is OpenType powerful enough to solve all script related typesetting problems in Arabic typography? Lucky you! Best regards, Stephan Hennig
schrieb Stephan Hennig:
schrieb Khaled Hosny:
Arabic handling in luatex (or rather luatex based packages) is done by OpenType layout features processed by lua code, no engine techniques are involved.
Is OpenType powerful enough to solve all script related typesetting problems in Arabic typography?
OK, I've read a (very) little bit about typesetting Arabic script and found in URL:http://en.wikipedia.org/wiki/Complex_script
Context-sensitive shaping (ligatures), where a character may change its shape, dependent on its location and/or the surrounding characters. For example, a character in Arabic script can have as many as four different shape-forms, depending on context.
As far as I know, OpenType uses (more or less) simple rules rather than patterns for contextual features (I may be wrong with that). Does that mean, glyph substitution in Arabic script can be managed by rules only? Best regards, Stephan Hennig
On Thu, Sep 22, 2011 at 11:57:16PM +0200, Stephan Hennig wrote:
schrieb Stephan Hennig:
schrieb Khaled Hosny:
Arabic handling in luatex (or rather luatex based packages) is done by OpenType layout features processed by lua code, no engine techniques are involved.
Is OpenType powerful enough to solve all script related typesetting problems in Arabic typography?
OK, I've read a (very) little bit about typesetting Arabic script and found in URL:http://en.wikipedia.org/wiki/Complex_script
Context-sensitive shaping (ligatures), where a character may change its shape, dependent on its location and/or the surrounding characters. For example, a character in Arabic script can have as many as four different shape-forms, depending on context.
As far as I know, OpenType uses (more or less) simple rules rather than patterns for contextual features (I may be wrong with that). Does that mean, glyph substitution in Arabic script can be managed by rules only?
OpenType is a mix of both; there are basic features where the font provides simple substitution lists and the OpenType engine (in our case entirely written in lua) have to do contextual analysis using predefined algorithms to decide where and when to apply those features e.g. initial, medial, final and isolated forms in Arabic. But there is also contextual features with more complex rules in which the font embeds all the knowledge about the context in which features are applied, and this is used for less straight forward cases where the context are font dependant rather than defined by script orthography. Regards, Khaled -- Khaled Hosny Egyptian Arab
schrieb Khaled Hosny:
But there is also contextual features with more complex rules in which the font embeds all the knowledge about the context in which features are applied, and this is used for less straight forward cases where the context are font dependant rather than defined by script orthography.
Thank you for the explanation! Any buzz words regarding these techniques I can throw into a search engine? Best regards, Stephan Hennig
On Sun, Sep 25, 2011 at 03:47:38PM +0200, Stephan Hennig wrote:
schrieb Khaled Hosny:
But there is also contextual features with more complex rules in which the font embeds all the knowledge about the context in which features are applied, and this is used for less straight forward cases where the context are font dependant rather than defined by script orthography.
Thank you for the explanation! Any buzz words regarding these techniques I can throw into a search engine?
OpenType contextual substitution and reverse chaining contextual substitution. Regards, Khaled -- Khaled Hosny Egyptian Arab
On Thu, Sep 22, 2011 at 07:29:38PM +0200, Stephan Hennig wrote:
schrieb Khaled Hosny:
On Wed, Sep 21, 2011 at 06:22:18PM +0200, Stephan Hennig wrote:
For Arabic script, I referred to glyph substitution. But I don't know enough about Arabic script to explain further. :)
Arabic handling in luatex (or rather luatex based packages) is done by OpenType layout features processed by lua code, no engine techniques are involved.
Didn't I warn you, I don't know much about Arabic script? :) What OpenType features are there that are used in Arabic script? Is OpenType powerful enough to solve all script related typesetting problems in Arabic typography? Lucky you!
More or less, and there are fonts addressing fairly complex problems implemented in OpenType. Of course it does not solve everything, but it has fairly generic mechanisms to address the vast majority of Arabic typographic issues. Regards, Khaled -- Khaled Hosny Egyptian Arab
participants (3)
-
Khaled Hosny
-
Stephan Hennig
-
Taco Hoekwater