Unicode normalization and Hebrew in ConTeXt
I am typesetting a document in Hebrew that includes pointing (e.g., vowels, shin and sin dots, dagesh, etc.) using ConTeXt. The Hebrew text that I want to typeset has been normalized into Unicode's NFC canonical form. It is well-known that the Unicode canonical ordering of Hebrew points conflicts with the recommended mark ordering of specific points based on their functions (see https://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf for more on this topic). Thankfully, many typesetting engines automatically reorder the points to ensure that they are combined according to the specifications of many fonts. I'm pretty sure that XeLaTeX is one of these, as it typesets Hebrew letters with multiple points correctly even when the Hebrew text is in NFC form. My question is, can ConTeXt with LuaTeX handle the same situation correctly? In the following minimal example, ConTeXt typesets pointed Hebrew correctly when the characters are in the typographically recommended order, but not when they are in Unicode canonical order: ``` %Setup Hebrew text font: \definefontfeature[f:pointedhebrew][default][ ccmp=yes, mark=yes, script=hebr ] \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=f:pointedhebrew] %Set the body font: \setupbodyfont[hebrew] %Set up right-to-left alignment: \setupalign[r2l] \starttext %Characters after normalization, in Unicode canonical order (bet + segol + dagesh + final nun): בֶּן %A word with characters in typographically recommended order (bet + dagesh + segol + final nun): בֶּן \stoptext ``` I typeset this using ConTeXt version 2020.03.10, as released with TeXLive 2020. I got the SBL Hebrew font from https://www.sbl-site.org/educational/BiblicalFonts_SBLHebrew.aspx. According to the font's user manual (see the link above the MWE), the font should be able to combine the marks to form the correct glyph regardless of their order after the consonant, but that doesn't seem to be the case here. I also tried using the predefined "hebrew" featureset, but that did not change anything. Is there some other OpenType feature or featureset I need to enable to fix this, or is there some module or option I can include to get ConTeXt to typeset Unicode-normalized Hebrew as if it were ordered in the recommended way, like XeLaTeX does? I see that the uninormalize module is mentioned in the thread "XeLaTeX, LuaLaTeX, fontspec, unicode and normalization" on TeX Stack Exchange ( https://tex.stackexchange.com/questions/229044/xelatex-lualatex-fontspec-uni...); can that be used with ConTeXt? Thank you, Joey
On 4/28/2020 1:59 PM, Joey McCollum wrote:
\definefontfeature[f:pointedhebrew][default][ ccmp=yes, mark=yes, script=hebr ] \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=f:pointedhebrew] %Set the body font: \setupbodyfont[hebrew] %Set up right-to-left alignment: \setupalign[r2l] \starttext %Characters after normalization, in Unicode canonical order (bet + segol + dagesh + final nun): בֶּן
%A word with characters in typographically recommended order (bet + dagesh + segol + final nun): בֶּן \stoptext
\startluacode fonts.handlers.otf.addfeature { name = "normalizehebrew", type = "chainsubstitution", prepend = 1, lookups = { { type = "multiple", data = { [0x5B6] = { 0x5BC, 0x5B6 }, }, }, }, data = { rules = { { current = { { 0x5B6 }, { 0x5BC } }, lookups = { 1, 0 }, }, }, }, } \stopluacode \definefontfeature [f:pointedhebrew] [hebrew] [normalizehebrew=yes] \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=f:pointedhebrew] \setupbodyfont[hebrew] \setupalign[r2l] \starttext בֶּן \quad בֶּן \par \stoptext How many such reorderings are there? (I saw some document about that font and it sounds like a bit messy wrt all these input variants.) (there are several mechanisms in context to deal with such issues, it's all about getting specs from users i.e. tex is all about control so in principle it should be doable) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 4/28/2020 1:59 PM, Joey McCollum wrote:
...
My question is, can ConTeXt with LuaTeX handle the same situation correctly? In the following minimal example, ConTeXt typesets pointed Hebrew correctly when the characters are in the typographically recommended order, but not when they are in Unicode canonical order: We (Joey and I) figured out how to best deal with this. As a result the predefined hebrew feature now will do the right thing for fonts that assume some specific ordering. So, this should work okay:
\definefontfamily[hebrew] [rm] [SBL Hebrew] [features=hebrew] in the most recent upload. Maybe there should be a wiki page that summarizes tests with hebrew fonts (but I leave that up to Joey). Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Thanks so much, Hans! I should be able to add a wiki page summarizing the
tests before the end of the week.
For reference purposes, do you know which version of ConTeXt has (or will
have) this update included?
Joey
On Thu, Apr 30, 2020 at 5:26 AM Hans Hagen
On 4/28/2020 1:59 PM, Joey McCollum wrote:
...
My question is, can ConTeXt with LuaTeX handle the same situation correctly? In the following minimal example, ConTeXt typesets pointed Hebrew correctly when the characters are in the typographically recommended order, but not when they are in Unicode canonical order: We (Joey and I) figured out how to best deal with this. As a result the predefined hebrew feature now will do the right thing for fonts that assume some specific ordering. So, this should work okay:
\definefontfamily[hebrew] [rm] [SBL Hebrew] [features=hebrew]
in the most recent upload.
Maybe there should be a wiki page that summarizes tests with hebrew fonts (but I leave that up to Joey).
Hans
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 4/30/2020 4:28 PM, Joey McCollum wrote:
Thanks so much, Hans! I should be able to add a wiki page summarizing the tests before the end of the week.
For reference purposes, do you know which version of ConTeXt has (or will have) this update included? todays upload
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Okay! I have not figured out how to add a new page to the wiki, but I was
able to add a section to the end of the "Arabic and Hebrew" page (
https://www.contextgarden.net/Arabic_and_Hebrew) discussing the issue,
providing a test, and briefly describing the fix.
Joey
On Thu, Apr 30, 2020 at 11:14 AM Hans Hagen
On 4/30/2020 4:28 PM, Joey McCollum wrote:
Thanks so much, Hans! I should be able to add a wiki page summarizing the tests before the end of the week.
For reference purposes, do you know which version of ConTeXt has (or will have) this update included? todays upload
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Hans,
Sorry to bring this up after over a year, but I just noticed something that
doesn't seem right. I implemented some contextual substitutions in my own
fork of the Keter YG Hebrew font (.ttf file attached) under the "dlig"
feature that should do the following two things:
1. If a *shin *with a *sin *dot (שׂ) is pointed with a *holam *(the
vowel point placed high and on the left), then the *shin*, *sin *dot,
and *holam *are combined into a single ligature that depicts the *sin *dot
and *holam *merged into a single point.
2. If a *shin *with a *shin *dot (שׁ) follows another letter pointed
with a *holam *(except for *vav*, which must be pointed with a *holam
haser*), then the shin and shin dot are replaced with a ligature that
moves the *shin* dot a bit to the right (so that it appears to be merged
with the preceding *holam*), and the combination of the preceding letter
and the actual holam is changed to just the preceding letter (thus
effectively stripping the old *holam*).
I've tested both of these features in FontForge, and they work as expected
there. Likewise, if I test them in the following XeLaTeX script, XeLaTeX
handles both rules correctly:
```
\documentclass{article}
%Set fonts and font features:
\usepackage{fontspec}
\setmainfont[Path=../fonts/KeterYG/, UprightFont = *-Medium, Script=Hebrew,
Ligatures=Discretionary]{KeterYG} % I'm using a local copy of the attached
font
\begin{document}
שֹׂבַע
עָשׂוֹר
קֹשֶׁט
שֹׁשַׁנִּים
עָשׂוֹר
מֹשֶׁה
שַׁלֹשׁ
\end{document}
```
But in ConTeXt, only rule (1) above works as expected. Here is a minimal
(non-)working example:
```
\starttypescriptcollection[keteryg]
\starttypescript[serif][keteryg]
\definefontsynonym[Serif][file:../fonts/KeterYG/KeterYG-Medium.ttf][features=hebrew]
% use a local copy of the attached font, with all the necessary Hebrew
features (this includes dlig by default)
\stoptypescript
\starttypescript[keteryg]
\definetypeface[keteryg][rm][serif][keteryg][default]
\stoptypescript
\stoptypescriptcollection
%Set up the main font:
\setupbodyfont[keteryg]
%Set up right-to-left alignment:
\setupalign[r2l]
\starttext
שֹׂבַע
עָשׂוֹר
קֹשֶׁט
שֹׁשַׁנִּים
עָשׂוֹר
מֹשֶׁה
שַׁלֹשׁ
\stoptext
```
In examples 3, 4, 6, and 7, the *holam *dot still appears before the *shin*
-with-merged-*shin*-dot-and-*holam *ligature, when it should be absent. (I
realize that it may be difficult to tell; in the last two examples, the
presence of two dots is easier to make out.)
Do you have any idea why this might be happening in ConTeXt? Does the glyph
reordering in font-imp-combining.lua take place before any OpenType
features in the font are applied?
Thanks again!
Joey
On Thu, Apr 30, 2020 at 4:17 PM Joey McCollum
Okay! I have not figured out how to add a new page to the wiki, but I was able to add a section to the end of the "Arabic and Hebrew" page ( https://www.contextgarden.net/Arabic_and_Hebrew) discussing the issue, providing a test, and briefly describing the fix.
Joey
On Thu, Apr 30, 2020 at 11:14 AM Hans Hagen
wrote: On 4/30/2020 4:28 PM, Joey McCollum wrote:
Thanks so much, Hans! I should be able to add a wiki page summarizing the tests before the end of the week.
For reference purposes, do you know which version of ConTeXt has (or will have) this update included? todays upload
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On 8/17/2021 2:07 AM, Joey McCollum wrote:
Sorry to bring this up after over a year, but I just noticed something that doesn't seem right. I implemented some contextual substitutions in my own fork of the Keter YG Hebrew font (.ttf file attached) under the "dlig" feature that should do the following two things: but you don't enable dlig
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Shouldn't dlig automatically be enabled under the "hebrew" feature set? In
font-pre.mkiv, hebrew inherits from semitic-complete, which sets dlig=yes.
Still, if I explicitly add dlig, as in the following example, things
change, but they still aren't right:
```
\starttypescriptcollection[keteryg]
\starttypescript[serif][keteryg]
\definefontsynonym[Serif][file:../fonts/KeterYG/KeterYG-Medium.ttf][features=hebrew]
% all the necessary Hebrew features, including dlig
\stoptypescript
\starttypescript[keteryg]
\definetypeface[keteryg][rm][serif][keteryg][default]
\stoptypescript
\stoptypescriptcollection
%Set up the main font:
\setupbodyfont[keteryg]
%Set up right-to-left alignment:
\setupalign[r2l]
%Explicitly add dlig (in case it wasn't there already):
\definefontfeature[plus-dlig][dlig=yes]
\starttext
\addff{plus-dlig}
שֹׂבַע
עָשׂוֹר
קֹשֶׁט
שֹׁשַׁנִּים
עָשׂוֹר
מֹשֶׁה
שַׁלֹשׁ
\stoptext ``` In examples 1, 3, 4, and 6, the *holam *of the preceding
letter (which should have been stripped in the contextual substitution)
just seems to have been moved farther up. In fact, the output looks like it
would look if I turned off the reordercombining feature. (And indeed, if I
manually reorder the glyphs to the Hebrew Layout Intelligence order, then
the results look like they did when I just used the "hebrew" feature.)
I may have forgotten to attach the font file I was using for this test. If
that is the case, it is available at https://github.com/jjmccollum/Keter-YG.
Joey
On Tue, Aug 17, 2021 at 5:19 AM Hans Hagen
On 8/17/2021 2:07 AM, Joey McCollum wrote:
Sorry to bring this up after over a year, but I just noticed something that doesn't seem right. I implemented some contextual substitutions in my own fork of the Keter YG Hebrew font (.ttf file attached) under the "dlig" feature that should do the following two things: but you don't enable dlig
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Thankfully, it looks like this was just a problem with my implementation of
the OpenType feature and not with ConTeXt's handling of it! (I worried that
it might be ConTeXt when I saw that XeLaTeX was handing the feature
correctly.) Hans graciously helped me identify the problem, and everything
looks good now!
Joey
On Tue, Aug 17, 2021 at 8:56 AM Joey McCollum
Shouldn't dlig automatically be enabled under the "hebrew" feature set? In font-pre.mkiv, hebrew inherits from semitic-complete, which sets dlig=yes.
Still, if I explicitly add dlig, as in the following example, things change, but they still aren't right:
```
\starttypescriptcollection[keteryg]
\starttypescript[serif][keteryg]
\definefontsynonym[Serif][file:../fonts/KeterYG/KeterYG-Medium.ttf][features=hebrew] % all the necessary Hebrew features, including dlig
\stoptypescript
\starttypescript[keteryg]
\definetypeface[keteryg][rm][serif][keteryg][default]
\stoptypescript
\stoptypescriptcollection
%Set up the main font:
\setupbodyfont[keteryg]
%Set up right-to-left alignment:
\setupalign[r2l]
%Explicitly add dlig (in case it wasn't there already):
\definefontfeature[plus-dlig][dlig=yes]
\starttext
\addff{plus-dlig}
שֹׂבַע
עָשׂוֹר
קֹשֶׁט
שֹׁשַׁנִּים
עָשׂוֹר
מֹשֶׁה
שַׁלֹשׁ
\stoptext ``` In examples 1, 3, 4, and 6, the *holam *of the preceding letter (which should have been stripped in the contextual substitution) just seems to have been moved farther up. In fact, the output looks like it would look if I turned off the reordercombining feature. (And indeed, if I manually reorder the glyphs to the Hebrew Layout Intelligence order, then the results look like they did when I just used the "hebrew" feature.)
I may have forgotten to attach the font file I was using for this test. If that is the case, it is available at https://github.com/jjmccollum/Keter-YG.
Joey
On Tue, Aug 17, 2021 at 5:19 AM Hans Hagen
wrote: On 8/17/2021 2:07 AM, Joey McCollum wrote:
Sorry to bring this up after over a year, but I just noticed something that doesn't seem right. I implemented some contextual substitutions in my own fork of the Keter YG Hebrew font (.ttf file attached) under the "dlig" feature that should do the following two things: but you don't enable dlig
----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
Thankfully, it looks like this was just a problem with my implementation of the OpenType feature and not with ConTeXt's handling of it! (I worried that it might be ConTeXt when I saw that XeLaTeX was handing the feature correctly.) Hans graciously helped me identify the problem, and everything looks good now! Just for the record: one can best try to make a font as robust as
On 8/17/2021 9:46 PM, Joey McCollum wrote: possible and not rely on side effects (ambiguous cases). When Idris and I tested some shapers we found that there can be inconsistent results (fwiw, in a rather complex font context agreed more often with uniscribe than xetex, but in the end on ehas to make the font okay for all i guess). When we started with opentype (luatex showed up in 2005) we took uniscribe as reference so that is our benchmark. And lack of specs made us figure out things stepwise. Now, if something works in one shaper and not in another it can of course be due to bugs but it can also be that the spec is simply fuzzy and choices have been made. There is then the danger that eventually bugs become features (I assume the amount of leverage matters here, and tex has zero) which then settles it (kind of) but that doesn't man that one should gamble on it. The same is true for fontnames: don't rely too much on the heuristics hard coded in programs (e.g. fontforge has some for font names, properties, glyph names, and although that is nice for recovery, it also makes other usage hard because fighting fuzzy heuristics is hard once information is lost). Btw, a side effect of your 'issue' is that I found a way to save some memory for some fonts (for now only in lmtx) at the cost of hopefully little extra runtime. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (2)
-
Hans Hagen
-
Joey McCollum