utf 8 / test file

Hans Hagen

7 Dec 2002 7 Dec '02

12:38 p.m.

Hi, I posted http://www.pragma-ade.com/temp/titus.pdf now, one thing with unicode (utf) is that support needs to have an associated font / language switch. Traditionally, tex font mechanisms have been complicated by the fact that there are many shapes per font and math has to be dealt with. If we're dealing with say sanskrit, is it then safe to assume that (1) we can switch to the language (if not yet done) when we encounter a unicode from the associated char/glyph range (2) can we assume that a relatively simple font mechanism is used (normal,bold,slanted) (3) can we assume that only a few (possibly derived from unicode) fonts are used, or at least one main type of font per language (4) can we standardize on utf-8 [and assume some preprocessor if not] [let's try to deal with the practical, so what's the practical usage] Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Show replies by date

Simon Pepping

8 Dec 8 Dec

9:38 p.m.

Hans, I have looked at how emacs and Unicode browser deal with unicode and fonts. Unicode browser is an application on the CD-ROM that comes with the Unicode 3.0 book. They both use font sets, i.e., collections of fonts that are put together so as to cover a large part of the Unicode range. The unicode browser scans the fonts in the order listed in its configuration file. When it finds a font that provides the sought character, it uses the glyph from that font. It is possible to refine the configuration: One can indicate that a font only contributes a certain range. One can exclude a range from a font. I believe this is a strategy that could be used by other applications. For Context this might be worked out as follows: Each font family must be in a known encoding. When a font family is loaded, the encoding and the associated font family are added to a table of loaded encodings. When a unicode character is sought, the loaded encodings are scanned in the order in which they appear in the table, until an encoding is found that provides a glyph for that character. It is possible that two font families are loaded that overlap in the range covered. Then the glyphs in the overlap area are taken from the font loaded first. This behaviour can be changed by configuring a font to contribute only a certain range of characters, or to exclude a certain range of characters from a font. This is a refinement that might be added later on. The NFSS in LaTeX provides a default encoding for a character (not to be confused with Context's default encoding, which is a different thing). When the character is not found in the current encoding, it is taken from this default encoding. Such a strategy may be more efficient than going through the list of loaded encodings. The above strategy may be efficient for a text that mainly consists of ascii characters. For a text that mainly consists of non-ascii characters, e.g. a chinese text, it requires much processing. Such a situation may be dealt with like encodings: When you are writing in a West European language, it is more efficient to use Latin-1 than utf-8. Similarly, when one is writing in chinese, a more efficient setup with a more limited coverage of characters may be used. I prefer to use font families rather than fonts. This makes it easy to switch from one font family to another, while keeping constant the other font parameters such as shape and weight. I like the way this is done in LaTeX's NFSS. I do not (yet) know much about the way Context organizes its fonts. One should be aware of the difference between character and glyph. Unicode is about characters, typesetters like TeX are about glyphs. It is very well possible that one font provides several variant glyphs for one and the same Unicode character. The user must have some way to express preference for one or the other. I think the user should load the appropriate input regime, as he only knows the encoding of the input file. For XML files it is different; in DocbookInContext I will try to load the appropriate input regime automatically from the encoding mentioned in the xml declaration. Configuring an appropriate font set is difficult. Perhaps font sets should be preconfigured, and fonts should be loaded as available. Good error messages when no font provides a glyph for a character in the text document should alert the user to missing fonts. These are my thoughts. Simon On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:

...

Hi,

I posted

http://www.pragma-ade.com/temp/titus.pdf

now, one thing with unicode (utf) is that support needs to have an associated font / language switch.

Traditionally, tex font mechanisms have been complicated by the fact that there are many shapes per font and math has to be dealt with.

If we're dealing with say sanskrit, is it then safe to assume that

(1) we can switch to the language (if not yet done) when we encounter a unicode from the associated char/glyph range

(2) can we assume that a relatively simple font mechanism is used (normal,bold,slanted)

(3) can we assume that only a few (possibly derived from unicode) fonts are used, or at least one main type of font per language

(4) can we standardize on utf-8 [and assume some preprocessor if not]

[let's try to deal with the practical, so what's the practical usage]

Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

_______________________________________________ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context

-- Simon Pepping email: spepping@scaprea.hobby.nl

Hans Hagen

9 Dec 9 Dec

12:26 a.m.

At 09:38 PM 12/8/2002 +0100, you wrote:

...

I have looked at how emacs and Unicode browser deal with unicode and fonts. Unicode browser is an application on the CD-ROM that comes with the Unicode 3.0 book. They both use font sets, i.e., collections of

so i have to buy that book -) what is the best place to get it? For Context this might be worked out as follows: Each font family must

...

be in a known encoding. When a font family is loaded, the encoding and the associated font family are added to a table of loaded encodings. When a unicode character is sought, the loaded encodings are scanned in the order in which they appear in the table, until an encoding is found that provides a glyph for that character.

hm, must think this over, esp since tex has no way (except measuring) to determine if a slot is really taken

...

It is possible that two font families are loaded that overlap in the range covered. Then the glyphs in the overlap area are taken from the font loaded first. This behaviour can be changed by configuring a font to contribute only a certain range of characters, or to exclude a certain range of characters from a font. This is a refinement that might be added later on.

The NFSS in LaTeX provides a default encoding for a character (not to be confused with Context's default encoding, which is a different thing). When the character is not found in the current encoding, it is taken from this default encoding. Such a strategy may be more efficient than going through the list of loaded encodings.

eh ... context does have fall backs (nearly always something default, often very plain); if something does not show up, it's probably not defined (yet); so, maybe i misunderstand you

...

The above strategy may be efficient for a text that mainly consists of ascii characters. For a text that mainly consists of non-ascii characters, e.g. a chinese text, it requires much processing. Such a situation may be dealt with like encodings: When you are writing in a West European language, it is more efficient to use Latin-1 than utf-8. Similarly, when one is writing in chinese, a more efficient setup with a more limited coverage of characters may be used.

chinese is even more complicated: there can be mixed utf-like encodings, and chars need some kind of postprocessing (adding breakpoints and so, or rotation in vertical typesetting, and/or special numbering things; this is already handled;)

...

I prefer to use font families rather than fonts. This makes it easy to switch from one font family to another, while keeping constant the other font parameters such as shape and weight. I like the way this is done in LaTeX's NFSS. I do not (yet) know much about the way Context organizes its fonts.

the organization is roughly the same as in any tex (a few axis); for scripts like chinese, names like SomeNiceFont automatically expand into SomeNiceFontBold at a certain size; this is a byproduct of using symbolic filenames; it also means a pretty nice way of mixing latin, idiographic, and math scripts.

...

One should be aware of the difference between character and glyph. Unicode is about characters, typesetters like TeX are about glyphs. It is very well possible that one font provides several variant glyphs for one and the same Unicode character. The user must have some way to express preference for one or the other.

i read somewhere that unicode is about scripts -) you're right; somehow we need to deal with the open type language dependent glyphs; pretty nasty

...

I think the user should load the appropriate input regime, as he only knows the encoding of the input file. For XML files it is different; in DocbookInContext I will try to load the appropriate input regime automatically from the encoding mentioned in the xml declaration.

Configuring an appropriate font set is difficult. Perhaps font sets should be preconfigured, and fonts should be loaded as available. Good error messages when no font provides a glyph for a character in the text document should alert the user to missing fonts.

Indeed i think that we should have some reasonable defaults, and it seems that there are no free complete unicode fonts, so we probably end up with something <range> => defaultfont but maybe even with <subrange> => defaultfont this needs some research. Thanks for your input. Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Taco Hoekwater

10:40 a.m.

On Mon, 09 Dec 2002 00:26:16 +0100, Hans wrote:

...

At 09:38 PM 12/8/2002 +0100, you wrote:

...
I have looked at how emacs and Unicode browser deal with unicode and fonts. Unicode browser is an application on the CD-ROM that comes with the Unicode 3.0 book. They both use font sets, i.e., collections of

so i have to buy that book -) what is the best place to get it?

www.unicode.org -- groeten, Taco

Giuseppe Bilotta

11:40 a.m.

New subject: Re[2]: utf 8 / test file

Monday, December 9, 2002 Hans Hagen wrote: HH> hm, must think this over, esp since tex has no way (except measuring) to HH> determine if a slot is really taken e-TeX can, IIRC. And since UTF support requires e-TeX anyway ... -- Giuseppe "Oblomov" Bilotta

Hans Hagen

12:30 p.m.

New subject: Re[2]: utf 8 / test file

At 11:40 AM 12/9/2002 +0100, you wrote:

...

Monday, December 9, 2002 Hans Hagen wrote:

HH> hm, must think this over, esp since tex has no way (except measuring) to HH> determine if a slot is really taken

e-TeX can, IIRC. And since UTF support requires e-TeX anyway ...

sure, but that still leaves the check if file exists probel, althoug a way out is to add the tfm paths to the tex search paths Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Simon Pepping

9:32 p.m.

On Mon, Dec 09, 2002 at 12:26:16AM +0100, Hans Hagen wrote:

...

At 09:38 PM 12/8/2002 +0100, you wrote:

For Context this might be worked out as follows: Each font family must

...
be in a known encoding. When a font family is loaded, the encoding and the associated font family are added to a table of loaded encodings. When a unicode character is sought, the loaded encodings are scanned in the order in which they appear in the table, until an encoding is found that provides a glyph for that character.

hm, must think this over, esp since tex has no way (except measuring) to determine if a slot is really taken

My idea was that the encoding should indicate which slots are provided (if the font complies).

...

...
The NFSS in LaTeX provides a default encoding for a character (not to be confused with Context's default encoding, which is a different thing). When the character is not found in the current encoding, it is taken from this default encoding. Such a strategy may be more efficient than going through the list of loaded encodings.

eh ... context does have fall backs (nearly always something default, often very plain); if something does not show up, it's probably not defined (yet); so, maybe i misunderstand you

I do not see this as a fallback but as an optimization. It is an effective means of knowing which encoding is on top for a certain character.

...

Indeed i think that we should have some reasonable defaults, and it seems that there are no free complete unicode fonts, so we probably end up with something

There are apps, e.g. XMLSpy, that rely on a single font to provide all required characters. I find that a waste of resources; the user's fonts are used much better if they can combined into a set. Simon -- Simon Pepping email: spepping@scaprea.hobby.nl

Simon Pepping

9:44 p.m.

On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:

...

Hi,

I posted

http://www.pragma-ade.com/temp/titus.pdf

U+0E5B = \char14:91 \startunicodevector 34 Can these numbers also be given in hexadecimal, e.g., \char "E:"5B? Unicode data sheets and font layout tables are usually given in hexadecimal. I find myself converting from hex to decimal and back; it would be easier to remain in hex. Simon -- Simon Pepping email: spepping@scaprea.hobby.nl

Hans Hagen

10 Dec 10 Dec

10:54 a.m.

At 09:44 PM 12/9/2002 +0100, you wrote:

...

On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:

...
Hi,

I posted

http://www.pragma-ade.com/temp/titus.pdf

U+0E5B = \char14:91 \startunicodevector 34

Can these numbers also be given in hexadecimal, e.g., \char "E:"5B? Unicode data sheets and font layout tables are usually given in hexadecimal. I find myself converting from hex to decimal and back; it would be easier to remain in hex.

in unic-ini (at the end) change: \chardef\utfunicommandmode=0 % 1 = hex \def\unicodecommandchar#1#2% {\string\char \ifcase\utfunicommandmode #1:#2\else\lchexnumbers#1:\lchexnumbers#2% \fi} \def\utfunifontcommand#1% {\xdef\unidiv{\number\utfdiv{#1}}% \xdef\unimod{\number\utfmod{#1}}% \ifnum#1<\utf@i \unicodecommandchar\unidiv\unimod \else\ifcsname\@@univector\unidiv\endcsname \@EA\string\csname\doutfunihash{\unidiv}{#1}\endcsname \else \unicodecommandchar\unidiv\unimod \fi\fi} ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

8242

Age (days ago)

8245

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

Giuseppe Bilotta
Hans Hagen
Simon Pepping
Taco Hoekwater

utf 8 / test file

Hans Hagen

Simon Pepping

Hans Hagen

Taco Hoekwater

Giuseppe Bilotta

Hans Hagen

Simon Pepping

Simon Pepping

Hans Hagen

tags

participants (4)