[NTG-context] Fwd: Unicode normalization and Hebrew in ConTeXt

28 Apr 2020

      Thank you for the prompt and thorough response!

If the reorderings have to be done for each pair of characters in different
combining classes that are not in the expected typographical order, then
there will be a lot (probably hundreds) of substitution rules. I am not
very familiar with coding in Lua, but if there is a way to add substitution
features for specific classes of points, then that would require a lot
fewer cases.

Unicode's canonical ordering of Hebrew marks is based on their combining
classes, with characters in higher combining classes being sorted after
those with lower combining classes in canonical order. The typographically
recommended ordering of certain characters is found in Table 1 (p. 12) of
https://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf. The following
list of character classes, with information about their Unicode combining
classes (which I retrieved from the Lua script
https://raw.githubusercontent.com/michal-h21/uninormalize/master/char-def-wi...),
is indexed after the character classes described in that table:
1. The consonants (Unicode points 05D0-05EA) have no combining class and
are never reordered; this is typographically correct.
2. Shin dot and sin dot (05C1-05C2) should be next, but Unicode places them
in combining classes 24 and 25, after the characters in recommended classes
3-5 and many of the characters in recommended class 6.
3. Dagesh / mapiq (05BC) should be next, but Unicode assigns it a combining
class of 21. This means that it will be incorrectly ordered before
characters in recommended class 2 and after characters in recommended
classes 4-6 after Unicode normalization.
4. Rafe (05BF) should be next, but Unicode assigns it a combining class of
23. Thus, it will be correctly placed after characters in recommended class
3, but incorrectly placed before characters in recommended class 2 after
Unicode normalization.
5. The holam and holam haser vowel points (05B9-05BA) should be next, but
Unicode places them in combining class 19. This means that it will be
placed incorrectly before characters in recommended classes 2-4 and after
all characters in recommended class 6 except 05BB after Unicode
normalization.
6. The characters in 0591, 0596, 059B, 05A2-05A7, 05AA, 05B0-05B8, 05BB,
05BD, 05C5, 05C7 should be treated as being in the same class, but Unicode
places them in combining classes 10-18, 20, 22, and 220.
7. The prepositive marks yetiv and dehi (059A, 05AD) should be next;
Unicode places them in combining class 222, so they should correctly come
after all characters in recommended classes 1-6.
8. The characters 0307, 0593-0595, 0597-0598, 059C-05A1, 05A8, 05AB-05AC,
05AF, 05C4 should be treated as being in the same class; Unicode places
them in combining class 230, so they should correctly come after all
characters in recommended classes 1-7.
9. The postpositive marks segolta, pashta, telisha qetana, and zinor (0592,
0599, 05A9, 05AE) should be next; Unicode places them in combining class
230, so they will need to be reordered after the characters in recommended
class 8.

This a lot of information, and I've probably not presented it as clearly as
I could, so if there is any confusion, please let me know, and I can try to
explain better. If there is any other information you need, please let me
know.

Thanks again!

On Tue, Apr 28, 2020 at 9:17 AM Hans Hagen  wrote:
...
On 4/28/2020 1:59 PM, Joey McCollum wrote:
...
\definefontfeature[f:pointedhebrew][default][
     ccmp=yes,
     mark=yes,
     script=hebr
]
\definefontfamily[hebrew] [rm] [SBL Hebrew] [features=f:pointedhebrew]
%Set the body font:
\setupbodyfont[hebrew]
%Set up right-to-left alignment:
\setupalign[r2l]
\starttext
     %Characters after normalization, in Unicode canonical order (bet +
segol + dagesh + final nun):
     בֶּן
%A word with characters in typographically recommended order (bet +
dagesh + segol + final nun):
     בֶּן
\stoptext
\startluacode
     fonts.handlers.otf.addfeature {
         name    = "normalizehebrew",
         type    = "chainsubstitution",
         prepend = 1,
         lookups = {
             {
                 type = "multiple",
                 data = {
                     [0x5B6] = { 0x5BC, 0x5B6 },
                 },
             },
         },
         data = {
             rules = {
                 {
                     current = { { 0x5B6 }, { 0x5BC } },
                     lookups = { 1, 0 },
                 },
             },
         },
     }
\stopluacode
\definefontfeature
   [f:pointedhebrew]
   [hebrew]
   [normalizehebrew=yes]
\definefontfamily[hebrew] [rm] [SBL Hebrew] [features=f:pointedhebrew]
\setupbodyfont[hebrew]
\setupalign[r2l]
\starttext
     בֶּן \quad בֶּן \par
\stoptext
How many such reorderings are there? (I saw some document about that
font and it sounds like a bit messy wrt all these input variants.)
(there are several mechanisms in context to deal with such issues, it's
all about getting specs from users i.e. tex is all about control so in
principle it should be doable)
Hans
-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------