Chars to glyphs. How to?

Jonathan Sauer

5 Dec 2007 5 Dec '07

10:01 a.m.

Hello, if I understand the svn version of the manual correctly, after hyphenation, character nodes are converted to glyph nodes (BTW: The new section 5 is very good, as it is very detailed). During this step, first ligatures are created, then kerning is applied. Now, LuaTeX uses Unicode internally, but current font need not be Unicode. Given the letter "ä" and a font in OT1-encoding, how would I convert this "ä" to an accented glyph *after* hyphenation has happened (to be able to hyphenate words containing "ä")? I looked into the manual, but found only two callbacks for inserting ligatures and kerning, not for converting characters to glyphs (incidentally, the documentation for these callbacks is very meager). Is this done using one of the font tables? If so, how do I set it up correctly? Thanks in advance, Jonathan

Show replies by date

Taco Hoekwater

5 Dec 5 Dec

11:38 a.m.

Jonathan Sauer wrote:

...

Hello,

if I understand the svn version of the manual correctly, after hyphenation, character nodes are converted to glyph nodes (BTW: The new section 5 is very good, as it is very detailed). During this step, first ligatures are created, then kerning is applied.

Now, LuaTeX uses Unicode internally, but current font need not be Unicode.

However, by default it is assumed to be. This assumption has to be made because there is no way to get the required mappings from a tfm file if this were not the case. So for remapping to happen, you have to write and register a bit of lua code. (there was a choice between this assumption and inventing a new dedicated font-related file format)

...

Given the letter "ä" and a font in OT1-encoding, how would I convert this "ä" to an accented glyph *after* hyphenation has happened (to be able to hyphenate words containing "ä")?

The logical place is in the ligaturing callback, because for tfm-based fonts, the actual ligature insertions can be done in one line via the node.ligaturing() built-in function. \directlua0{ callback.register('ligaturing', function (head, tail) convert_to_glyphs(head) node.ligaturing(head) return end) } Note: you don't have to worry about return values because the head node that is passed on to the callback is guaranteed not to be a glyph_node (if need be, a temporary node will be inserted), and therefore it cannot be affected by the mutations that take place. In this case, the 'tail' node is not used (the link of 'tail' is guaranteed to be 'nil'). You have to write the convert_to_glyphs function of course, and it could look like this: \directlua0{ function find_font_glyph (f,c) % you should do some real work here! return c end function convert_to_glyphs (head) for v in node.traverse_id(head) do if v.subtype=1 then v.character = find_font_glyph(v.font,v.character) v.subtype=0 end end end }

...

I looked into the manual, but found only two callbacks for inserting ligatures and kerning, not for converting characters to glyphs (incidentally, the documentation for these callbacks is very meager).

The manual was not finished yet, I have just added a few paragraphs on this subject. Best wishes, Taco

Jonathan Sauer

12:26 p.m.

Hello, thanks for your quick and detailed reply!

...

[...]

...
Given the letter "ä" and a font in OT1-encoding, how would I convert this "ä" to an accented glyph *after* hyphenation has happened (to be able to hyphenate words containing "ä")?

The logical place is in the ligaturing callback, because for tfm-based fonts, the actual ligature insertions can be done in one line via the node.ligaturing() built-in function.

[...]

Note: you don't have to worry about return values because the head node that is passed on to the callback is guaranteed not to be a glyph_node (if need be, a temporary node will be inserted), and therefore it cannot be affected by the mutations that take place. In this case, the 'tail' node is not used (the link of 'tail' is guaranteed to be 'nil').

You mean if I insert additional nodes inbetween? If I only modify the nodes themselves without touching the list, this should not be a problem anyway.

...

You have to write the convert_to_glyphs function of course, and it could look like this:

\directlua0{ function find_font_glyph (f,c) % you should do some real work here! return c end

function convert_to_glyphs (head) for v in node.traverse_id(head) do if v.subtype=1 then v.character = find_font_glyph(v.font,v.character) v.subtype=0 end end end }

Shouldn't the subtype be "2"? In setion 7.1.2.12, bit 1 is used to denote a glyph, if I understand the manual correctly. Since I convert the character to a glyph in the font, this bit should be set afterwards. Also, in the manual the field is called "char", not "character". Which one is correct? (both?) Anyway, now I have a place where to start. Still, I am a bit clueless on how to create an accented character node. Or do I have to insert another character node in the list containing the accent (i.e. '"' to create an 'ä' from an 'a')? How do I tell this new node to overlap the 'a'?

...

Best wishes, Taco

Thanks in advance, Jonathan

Hans Hagen

12:29 p.m.

Jonathan Sauer wrote:

...

Anyway, now I have a place where to start. Still, I am a bit clueless on how to create an accented character node. Or do I have to insert another character node in the list containing the accent (i.e. '"' to create an 'ä' from an 'a')? How do I tell this new node to overlap the 'a'?

depends ... one of (1) use \accent (2) insert an glyph node (") after the current glyph node (a) and put a kern nonde in between, or use the x/y offsets in the glyph nodes (3) make a virtual character in the font and change the glyphs (a) reference to the one of the virtual char ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Taco Hoekwater

12:37 p.m.

...

You mean if I insert additional nodes inbetween? If I only modify the nodes themselves without touching the list, this should not be a problem anyway.

right.

...

Shouldn't the subtype be "2"? In setion 7.1.2.12, bit 1 is used to denote a glyph, if I understand the manual correctly. Since I convert the character to a glyph in the font, this bit should be set afterwards.

I am somewhat in doubt about that. The manual and my coding practise do not quite agree on what the value really should be, but bit zero should definately be cleared (in fact, that is the only test made, right now). Probably, the best approach will be to clear bit zero here (just before ligaturing), and set bit one *after* both ligaturing and kerning have happened. Comments and suggestions are welcome.

...

Also, in the manual the field is called "char", not "character". Which one is correct? (both?)

"char" is correct. Sorry, bad typing.

...

Anyway, now I have a place where to start. Still, I am a bit clueless on how to create an accented character node. Or do I have to insert another character node in the list containing the accent (i.e. '"' to create an 'ä' from an 'a')?

A virtual font would be easier, but: yes, you can do it that way.

...

How do I tell this new node to overlap the 'a'?

By adding some extra kerning around it, and/or tweaking the values of the "xoffset" and "yoffset" fields (the need for such node list manipulations is the main reason why a virtual font is easier). Best wishes, Taco

Jonathan Sauer

1:25 p.m.

Hello,

...

...
Shouldn't the subtype be "2"? In setion 7.1.2.12, bit 1 is used to denote a glyph, if I understand the manual correctly. Since I convert the character to a glyph in the font, this bit should be set afterwards.

I am somewhat in doubt about that. The manual and my coding practise do not quite agree on what the value really should be, but bit zero should definately be cleared (in fact, that is the only test made, right now).

Mmm ... if the subtype is a bitfield, IMO one bit should be set; otherwise the subtype would be undefined. Still, why is the subtype ia bitfield anyway? Can a glyph node i.e. be both a character node as well as a ligature node?

...

[...]

...
Anyway, now I have a place where to start. Still, I am a bit clueless on how to create an accented character node. Or do I have to insert another character node in the list containing the accent (i.e. '"' to create an 'ä' from an 'a')?

A virtual font would be easier, but: yes, you can do it that way.

Let me see if I get this right: 1. I load the font, i.e. in OT1-encoding, just like any other TeX font. 2. I modify the font's table (i.e. by using the "define_font" callback): I add all characters I want to support to the font's "characters" array. Each of these new characters contains a "commands" field which constructs the character from several others in the font. 3. I use the (artificial, as described in section 6.2.1) font and am a happy clam. Questions: - Section 6 of the manual states that the key of the "characters" table is the "internal code TeX knows this character by". How do I determine this code? Is this simply the Unicode code point? - If I handle accented characters this way, I do not have to create a "ligaturing" callback, do I? - Does the "char" font command move the output pointer?

...

...
How do I tell this new node to overlap the 'a'?

By adding some extra kerning around it, and/or tweaking the values of the "xoffset" and "yoffset" fields (the need for such node list manipulations is the main reason why a virtual font is easier).

It seems like it, especially since I would need some kind of data structure to describe the character-replacements anyway. Also, a virtual font would most likely be easier on the garbage collector.

...

Best wishes, Taco

Jonathan

Hans Hagen

1:40 p.m.

Jonathan Sauer wrote:

...

Hello,

...
...
Shouldn't the subtype be "2"? In setion 7.1.2.12, bit 1 is used to denote a glyph, if I understand the manual correctly. Since I convert the character to a glyph in the font, this bit should be set afterwards. I am somewhat in doubt about that. The manual and my coding practise do not quite agree on what the value really should be, but bit zero should definately be cleared (in fact, that is the only test made, right now).

Mmm ... if the subtype is a bitfield, IMO one bit should be set; otherwise the subtype would be undefined.

Still, why is the subtype ia bitfield anyway? Can a glyph node i.e. be both a character node as well as a ligature node?

a ligature is not really a character -) anyhow, zero means 'nothing done', while other values < 256 means, something done; you can use bits >= 256 for your own usage since luatex only looks at the first 8 bits

...

Let me see if I get this right:

1. I load the font, i.e. in OT1-encoding, just like any other TeX font.

2. I modify the font's table (i.e. by using the "define_font" callback): I add all characters I want to support to the font's "characters" array. Each of these new characters contains a "commands" field which constructs the character from several others in the font.

3. I use the (artificial, as described in section 6.2.1) font and am a happy clam.

indeed; of course you can also make the virtual font independent of luatex

...

Questions:

- Section 6 of the manual states that the key of the "characters" table is the "internal code TeX knows this character by". How do I determine this code? Is this simply the Unicode code point?

your choice, as long as you also provide the index i.e. where to find the glyph in the font file

...

- If I handle accented characters this way, I do not have to create a "ligaturing" callback, do I?

no, this is controlled by the ligature subtable; if such a table is there (in the font that is) then things happen automatically (same for kerning)

...

- Does the "char" font command move the output pointer?

...

...
...
How do I tell this new node to overlap the 'a'? By adding some extra kerning around it, and/or tweaking the values of the "xoffset" and "yoffset" fields (the need for such node list manipulations is the main reason why a virtual font is easier).

It seems like it, especially since I would need some kind of data structure to describe the character-replacements anyway. Also, a virtual font would most likely be easier on the garbage collector.

fonts are not garbage collected (well, at the lua end the table may be collected of course) but in tex itself it's allocated permanently Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Taco Hoekwater

1:51 p.m.

Hans Hagen wrote:

...

...
Still, why is the subtype ia bitfield anyway? Can a glyph node i.e. be both a character node as well as a ligature node?

a ligature is not really a character -)

anyhow, zero means 'nothing done', while other values < 256 means, something done; you can use bits >= 256 for your own usage since luatex only looks at the first 8 bits

A glyph_node is either a pre-hyphenation character (only bit 0 is set == its value is 1), or it is something else. Some of the "something elses" really make more sense when subtype is a bit field (ligatures and ghosts). The advantage of using a the first and second bit for "character" and "glyph", is that we could make tests like: is this a pre-ligkern node (0 and 1 both unset), or a post-ligkern glyph node (with bit 1 set), or a character (bit 0 set). Such a split allows speedups as some processing steps can be skipped. Currently however, only the character test is implemented.

...

...
- If I handle accented characters this way, I do not have to create a "ligaturing" callback, do I?

correct.

...

...
- Does the "char" font command move the output pointer?

yes. Best wishes, Taco

Jonathan Sauer

2:07 p.m.

Hello,

...

[...]

...
Questions:

- Section 6 of the manual states that the key of the "characters" table is the "internal code TeX knows this character by". How do I determine this code? Is this simply the Unicode code point?

your choice, as long as you also provide the index i.e. where to find the glyph in the font file

Now I'm back to confused. How is this my choice? Let's say I have an input file with character "ä". Would its "internal code TeX knows this character by" be Unicode's "LATIN CAPITAL LETTER A WITH DIAERESIS" (00C4)? If yes: So in order to add this character to an OT1-font, would I simply add it to the "characters" table with key hex C4 (and an appropriate command table)?

...

...
- Does the "char" font command move the output pointer?

?

If I have a command table "char 23; char 42", will character number 42 be typeset on top of character 23, or to the right? (yes, it does, as Taco answered) Could this be added to the manual, to make the behaviour of the "char" font command clearer? (since some other commands explicitely state of they move the output pointer)

...

Hans

Jonathan

Taco Hoekwater

3:06 p.m.

Jonathan Sauer wrote:

...

Hello,

...
[...]

...
Questions:

- Section 6 of the manual states that the key of the "characters" table is the "internal code TeX knows this character by". How do I determine this code? Is this simply the Unicode code point?

You do not have to determine that code, you are defining it by feeding a specific 'characters' array back into the define_font callback. If your text has to be hyphenated and your hyphenation patterns are true Unicode, then this had better be a Unicode code point. Proper Unicode patterns is our recommended practise, but characters array deviations are useful for symbolic fonts and for language+font combinations that are hard to express in Unicode (patterns have to be UTF-8 encoded, but there is no strict requirement for Unicode compliance).

...

Now I'm back to confused. How is this my choice? Let's say I have an input file with character "ä". Would its "internal code TeX knows this character by" be Unicode's "LATIN CAPITAL LETTER A WITH DIAERESIS" (00C4)? If yes: So in order to add this character to an OT1-font, would I simply add it to the "characters" table with key hex C4 (and an appropriate command table)?

Yes, spot on.

...

If I have a command table "char 23; char 42", will character number 42 be typeset on top of character 23, or to the right? (yes, it does, as Taco answered) Could this be added to the manual, to make the behaviour of the "char" font command clearer? (since some other commands explicitely state of they move the output pointer)

I will be fixing the manual in the coming weeks. Your questions are helping a lot in that they make it clear where further explanations are needed. Best wishes, Taco

6417

Age (days ago)

6417

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Hans Hagen
Jonathan Sauer
Taco Hoekwater