Writing Japanese using ConTeXt

Tim 't Hart

8 Jun 2003 8 Jun '03

1:48 p.m.

Hello all, This is my first message to the list. I've been using ConTeXt for a few months now, and so far it does everything I want to do with it, plus much and much more! Recently, I've made the 'unwise' decision to start studying Japanese next year, and of course I want to keep on using ConTeXt to write my school papers. I am already able to create Japanese documents using a terrific Japanese TeX distribution (w32tex) and pLaTeX, but everyone on this mailinglists knows that LaTeX is kinda 'weird' (to put it mildly) when you are used to the beauty that is ConTexT! :-) So I decided to find a way to write Japanese in ConTeXt. First I tried using the eOmega/ConTeXt combination since I have some great OTPs for it, but soon found out that Omega is still "the TeX of the future", in other words, not the "TeX of today" and extremely unstable. Then I decided to try ConTeXt's UTF-8 support. I created the following test file: -------------- \chardef\utfunihashmode=1 \setupunicodefont [japanese] [scale=1.0] \definefontsynonym [JapaneseMinchoRegular][cyberb] \defineunicodefont [Mincho][JapaneseMincho][japanese] \Mincho \enableregime[utf] \starttext ... <Imagine a bunch of UTF-8 encoded Japanese characters here> ... \stoptext -------------- cyberb is the Unicode font cyberbit.ttf which I installed using ttf2tfm: ttf2tfm cyberbit.ttf cyberb@Unicode@ For output I use dvipdfmx with the following line added to the map file: cyberb@Unicode@ Identity-H :0:cyberbit.ttf Well, to my big surprise, it worked! I saw the characters without a problem. Using the 'scale' option of \setupunicodefont I could also change the size of the characters. Great! But since there are usually no spaces in a Japanese sentence, there is no line breaking. And as you can imagine, line breaking is a useful feature to have! :-) I imagined that the line breaking problem is also happening when someone wants to write Chinese, so I decided to take a look in ConTeXt's Chinese module to see how it is handled there. I saw that the Chinese module adds an interglyph space after a character, which is breakable by TeX. This happens in a macro that is (indirectly) called using \setupunicodefont and the 'command' option. I decided to try the same in my test file. But first, I checked to see if using the 'command' option in \setupunicodefont actually worked: I added the following macro: \def\HandleJapaneseGlyph {\insertunicodeglyph} And changed my \setupunicodefont into: \setupunicodefont [japanese] [scale=1.0, command=\HandleJapaneseGlyph] Well, I still get Japanese characters like normal. I imagined that if I removed \insertunicodeglyph from my macro, I wouldn't get to see them. But this is not the case. I found out that I can do anything in my macro, but it doesn't have an effect on the Japanese characters. They still get printed. I also found out that I can even use command=\whateveryoulike and it still wouldn't complain that such a macro doesn't exist. I get the feeling that the command option is completely ignored. Apparently, my idea isn't going to work. :-( To make a long story even longer, I would like to know why it doesn't work, or what I should do in order to make it work. What is the correct method to divert the Unicode character output to another macro so that I can add a breakable space after each character? Well, I've been using ConTeXt for only a few months now, so maybe the complexity of this is way over my head. At least it kept me busy! But on the other hand, I don't think writing Japanese is much more different than writing Chinese. It must be possible to achieve without much trouble or reinventing the wheel. Thanks for listening, Tim

Show replies by date

Matthew Huggett

9 Jun 9 Jun

4:16 p.m.

Tim 't Hart wrote:

...

Recently, I've made the 'unwise' decision to start studying Japanese next year, and of course I want to keep on using ConTeXt to write my school papers. [....] So I decided to find a way to write Japanese in ConTeXt.

First I tried using the eOmega/ConTeXt combination since I have some great OTPs for it, but soon found out that Omega is still "the TeX of the future", in other words, not the "TeX of today" and extremely unstable.

Then I decided to try ConTeXt's UTF-8 support. I created the following test

I asked about Japanese a while back. Hans requested more information on encodings, fonts, etc. I don't know enough about these things or ConTeXt to know what is needed exactly. From what I've read, unicode is not that popular in Japan itself. The most common encodings here are a) iso-2022-jp (7bit) b) japanese-iso-8bit (a.k.a euc-japan-1990, euc-japan, euc-jp) c) japanese-shift-jis (shift jis 8bit; common under MS Windows) "Describe Language Environment" under MULE in Gnu Emacs gives some info. Ken Lunde of Adobe has a book or two on processing Japanese. Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems: a) Kanji (Chinese Characters) b) Hiragana (Syllabic script for representing grammatical endings and words for which Kanji are not commonly used.) c) Katakana (Syllabic script for representing foreign words, some scientfic words (flora, fauna), and for emphasis) d) Romaji -- lit. "Roman Characters" (Sometimes foreign languages, especially English, are represented in latin script) It is more common than you might imagine. I guess I need to track down a few sample documents. I tried to turn up some info on Japanese typesetting rules but had no luck. best wishes, Matt

Tim 't Hart

6:33 p.m.

Matthew Huggett wrote:

...

I asked about Japanese a while back. Hans requested more information on encodings, fonts, etc. I don't know enough about these things or ConTeXt to know what is needed exactly.

...

From what I've read, unicode is not that popular in Japan itself. ...

Unicode wasn't that popular because Unix-like operating systems used EUC as encoding, and Microsoft used their own invented Shift-JIS encoding. So there is still a lot of digital text out there written in these encodings, and a lot of tools still use it. But I think that if you want to write new texts, using Unicode shouldn't be a problem for most users. I guess that most editors supporting Asian encodings also make it possible to save in UTF-8. I think nowadays it's easier to find a Unicode enabled editor than it is to find a Shift-JIS/EUC editor! (Well, on Windows anyway...). Since ConTeXt already supports UTF-8, I don't see a reason to make thinks more difficult than they already are by writing text in other encodings. When I look at the source of the Chinese module, the most difficult part for me to understand is the part about font encoding, the enco-chi.tex file, and the use of \defineuclass in that file. I guess it has to do something with mapping the written text to the font. If I understand correctly, the Chinese module doesn't use Unicode fonts, but GBK or Big5 encoded fonts. I guess that if you want to make a proper Japanese module, you'll need to support JIS or Shift-JIS encoded fonts. But on the other hand, maybe we don't need to support that since there are a lot of Japanese Unicode fonts available. I use WinXP, and there we have msmincho.ttc and msgothic.ttc, which are both Unicode fonts. I also use kochi-mincho.ttf and kochi-gothic.ttf, which are both freely available Japanese Unicode fonts. And Cyberbit is a Unicoded font as well. Commercially available fonts by Dynalab (Dynafont Japanese TrueType collection is quite cheap and very good) are also Unicode fonts. Again, I don't think we should make it difficult for ourselves by trying to support non-Unicode fonts while unicoded Japanese fonts are easy to use and widely available.

...

Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems

The fact that Japanese uses four writing systems is not really a problem. Hiragana and Katakana (Kana) are just part of other Unicode ranges than Kanji/Chinese. Things might get difficult if you want to use different fonts for Kana than you are using for Kanji. Then you need to assign a different font to a different Unicode range. But I have no idea why somebody wants to do such a thing! Just using Unicode and a Japanese Unicode font will take care of things. If you type Romaji/Latin characters in the example I posted yesterday, they get printed in CMR. I did some tests and I could change the font in any other font I wanted to, just by using the normal ConTeXt font mechanisms. So I guess it is easy to mix Japanese fonts with normal Latin fonts.

...

I guess I need to track down a few sample documents. I tried to turn up some info on Japanese typesetting rules but had no luck.

The only info I got is from Ken Lunde's CJKV book, where he mentions some rules about CJK line breaking. Also, some characters are allowed to protrude in the right margin. I have some OTP's for Omega which handles all of this. They can be seen here: http://www.math.jussieu.fr/~zoonek/LaTeX/Omega-Japanese/doc.html At first I wanted to use Omega with ConTeXt so that I could use these OTP's, but Omega isn't really stable. With the ConTeXt example that I posted yesterday, I am already able to write Japanese in UTF-8, use a Unicoded Japanese font in ConTeXt, and get Japanese output. I hope the hard part is already behind me! :-) The only thing that still puzzles me is how I can add interglyph space so that TeX can break the lines. If someone can help, I would really appreciate it! My best, Tim

Matt Gushee

10 Jun 10 Jun

1:24 a.m.

On Mon, Jun 09, 2003 at 11:16:27PM +0900, Matthew Huggett wrote:

...

...
Recently, I've made the 'unwise' decision to start studying Japanese next year,

Unwise? Only if you don't really want to do it, or if you are laboring under illusions--left over from the 80s--that it will guarantee you a lucrative and glamorous career in international trade ;-) But anyway, I am also interested in using ConTeXt for Japanese, and would be glad to contribute what I can to this effort.

...

I asked about Japanese a while back. Hans requested more information on encodings, fonts, etc. I don't know enough about these things or ConTeXt to know what is needed exactly.

I don't know much about ConTeXt internals, but do know something about "these things," so I may be able to help. Was Hans' request on the mailing list? If you know when it was posted, perhaps I can look it up.

...

Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems:

On Mon, Jun 09, 2003 at 06:33:49PM +0200, Tim 't Hart wrote:

...

Unicode wasn't that popular because Unix-like operating systems used EUC as encoding, and Microsoft used their own invented Shift-JIS encoding.

There were also cultural/political reasons, with perhaps a touch of Not Invented Here syndrome. But that's a different story.

...

So there is still a lot of digital text out there written in these encodings, and a lot of tools still use it. But I think that if you want to write new texts, using Unicode shouldn't be a problem for most users. I guess that most editors supporting Asian encodings also make it possible to save in UTF-8. I think nowadays it's easier to find a Unicode enabled editor than it is to find a Shift-JIS/EUC editor! (Well, on Windows anyway...).

Yes, recent Windows versions (starting with NT 4.0 in the business series, and ... not sure ... ME? in the consumer series) use some form of Unicode as their base encoding, so I think it is now the norm for Windows text editors to support UTF-8 ... I'm pretty sure TextPad does, for example.

...

Since ConTeXt already supports UTF-8, I don't see a reason to make thinks more difficult than they already are by writing text in other encodings.

On the face of it that makes sense. But I don't think it's safe to make a blanket assumption that the text in a ConTeXt document will originate with the creator of the document, or that it will be newly written. Also, UTF-8 support is still a bit half-baked on Unix/Linux systems.

...

When I look at the source of the Chinese module, the most difficult part for me to understand is the part about font encoding, the enco-chi.tex file, and the use of \defineuclass in that file. I guess it has to do something with mapping the written text to the font.

Most likely. I might be able to glean something useful from that file. I'll take a look when I can find the time.

...

I guess that if you want to make a proper Japanese module, you'll need to support JIS or Shift-JIS encoded fonts.

This would be a good idea for Type 1 font support. It seems to me that almost all recent Japanese TrueType fonts have a Unicode CMap.

...

But on the other hand, maybe we don't need to support that since there are a lot of Japanese Unicode fonts available. I use WinXP, and there we have msmincho.ttc and msgothic.ttc, which are both Unicode fonts.

Can PDFTeX handle TTC files? I know ttf2afm/ttf2pk can process them, but I have tried 2 or 3 times to include a Japanese TTC font directly in a PDFTeX document, but was never able to make it work.

...

And Cyberbit is a Unicoded font as well. Commercially available fonts by Dynalab (Dynafont Japanese TrueType collection is quite cheap and very good) are also Unicode fonts. Again, I don't think we should make it difficult for ourselves by trying to support non-Unicode fonts while unicoded Japanese fonts are easy to use and widely available.

Well, it can be done in stages. I think that any serious attempt to support Japanese in ConTeXt should encompass all common encodings. But I don't see anything wrong with starting out Unicode-only.

...

...
Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems

The fact that Japanese uses four writing systems is not really a problem.

Maybe it's not a big problem. But it is certainly more complex than chinese, since there is a mixture of proportional and fixed-width characters, and the presence of Kana and Romaji complicate the line-breaking rules.

...

...
I guess I need to track down a few sample documents. I tried to turn up some info on Japanese typesetting rules but had no luck.

What would a good sample consist of? I can probably find something.

...

The only info I got is from Ken Lunde's CJKV book, where he mentions some rules about CJK line breaking.

Yes, Lunde is good, but he doesn't go into enough detail to serve as an implementor's guide. I've also searched for more info on this subject; my impression is that besides Lunde's books there is really nothing available in English. I could probably make some sense out of the Japanese works that are available, but it would take up much more time than I have.

...

With the ConTeXt example that I posted yesterday, I am already able to write Japanese in UTF-8, use a Unicoded Japanese font in ConTeXt, and get Japanese output. I hope the hard part is already behind me! :-) The only thing that still puzzles me is how I can add interglyph space so that TeX can break the lines. If someone can help, I would really appreciate it!

Sorry, no idea. But it sounds like you've made an admirable effort so far. I was working along similar lines a couple of years ago, but was never able to produce anything useful. Guess you're a better TeXnician than I. -- Matt Gushee When a nation follows the Way, Englewood, Colorado, USA Horses bear manure through mgushee@havenrock.com its fields; http://www.havenrock.com/ When a nation ignores the Way, Horses bear soldiers through its streets. --Lao Tzu (Peter Merel, trans.)

Matthew Huggett

9:41 a.m.

Matt Gushee wrote:

...

What would a good sample consist of? I can probably find something.

Well, for starters I guess samples showing the interaction of the four writing scripts (I'm thinking of glyph spacing and line-breaking here; e.g., in the transition from native script to Romaji and back again). Do you know much about different heading styles? I suppose they are similar to the Chinese ones depending on how traditional the text is; i.e., kanji or Arabic numerals, the presence of a "section" kanji before the numbering, etc. Examples of Furigana would be good. Matt Huggett

Hans Hagen

10:13 a.m.

At 17:24 09/06/2003 -0600, Matt Gushee wrote:

...

...
Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems:

dunno, could also be a challenge; as long as tagging is done properly i see no real problem there

...

On Mon, Jun 09, 2003 at 06:33:49PM +0200, Tim 't Hart wrote:

...
Unicode wasn't that popular because Unix-like operating systems used EUC as encoding, and Microsoft used their own invented Shift-JIS encoding.

There were also cultural/political reasons, with perhaps a touch of Not Invented Here syndrome. But that's a different story.

same as in china: many encodings alongside unicode

...

...
Since ConTeXt already supports UTF-8, I don't see a reason to make thinks more difficult than they already are by writing text in other encodings.

On the face of it that makes sense. But I don't think it's safe to make a blanket assumption that the text in a ConTeXt document will originate with the creator of the document, or that it will be newly written. Also, UTF-8 support is still a bit half-baked on Unix/Linux systems.

i'm sure that wang lei (on this list) can help you out; if i'm right he is aware of japanese font demands

...

...
I guess that if you want to make a proper Japanese module, you'll need to support JIS or Shift-JIS encoded fonts.

This would be a good idea for Type 1 font support. It seems to me that almost all recent Japanese TrueType fonts have a Unicode CMap.

one of the first things to do is to collect fonts in suitable encodings and post them somewhere (or at least post scripts that generate them)

...

Can PDFTeX handle TTC files? I know ttf2afm/ttf2pk can process them, but I have tried 2 or 3 times to include a Japanese TTC font directly in a PDFTeX document, but was never able to make it work.

dunno, maybe dvipdfmx can

...

Well, it can be done in stages. I think that any serious attempt to support Japanese in ConTeXt should encompass all common encodings. But I don't see anything wrong with starting out Unicode-only.

in that case some range mapping should be defined; proper test files, etc

...

...
...
Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems

The fact that Japanese uses four writing systems is not really a problem.

Maybe it's not a big problem. But it is certainly more complex than chinese, since there is a mixture of proportional and fixed-width characters, and the presence of Kana and Romaji complicate the line-breaking rules.

hm, but as long as the rules are clear, things should be configurable as much as possible

...

...
The only info I got is from Ken Lunde's CJKV book, where he mentions some rules about CJK line breaking.

Yes, Lunde is good, but he doesn't go into enough detail to serve as an implementor's guide. I've also searched for more info on this subject;

right, many nice tables and glyphs -)

...

my impression is that besides Lunde's books there is really nothing available in English. I could probably make some sense out of the Japanese works that are available, but it would take up much more time than I have.

then ... write it down in a document/manual and make that the test case for context; if the manual can be processed we're done! Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Tim 't Hart

9:36 p.m.

Hello Hans and Matt,

...

...
Can PDFTeX handle TTC files? I know ttf2afm/ttf2pk can process them, but I have tried 2 or 3 times to include a Japanese TTC font directly in a PDFTeX document, but was never able to make it work.

dunno, maybe dvipdfmx can

I don't think PDFTeX can use TTC fonts. I use PDFTeX for DVI output and use dvipdfmx for PDF. Map files for dvipdfmx support fonts inside a TrueType Collection. TTF2TFM also supports the extra fonts inside a TTC by using the -f switch. For example, msmincho.ttc contains MS-Mincho and MS-PMincho: ttf2tfm msmincho.ttc msmin@Unicode@ (will make TFM for MS-Mincho) ttf2tfm msmincho.ttc -f 1 mspmin@Unicode@ (will make TFM for MS-PMincho) The map file for dvipdfmx will then look like: msmin@Unicode@ Identity-H :0:msmincho.ttc (for MS-Mincho) mspmin@Unicode@ Identity-H :1:msmincho.ttc (for MS-PMincho)

...

...
Well, it can be done in stages. I think that any serious attempt to support Japanese in ConTeXt should encompass all common encodings. But I don't see anything wrong with starting out Unicode-only.

in that case some range mapping should be defined; proper test files, etc

Right now I'm working on a home page which contains information about where to find Japanese fonts and how to install them for ConTeXt/dvipdfmx. I will also add some example files of what is already possible in ConTeXt. I'll post the URL soon. My best, Tim

Hans Hagen

10:18 a.m.

At 18:33 09/06/2003 +0200, Tim 't Hart wrote:

...

When I look at the source of the Chinese module, the most difficult part for me to understand is the part about font encoding, the enco-chi.tex file, and the use of \defineuclass in that file. I guess it has to do something with mapping the written text to the font. If I understand correctly, the Chinese module doesn't use Unicode fonts, but GBK or Big5 encoded fonts.

indeed, there is quite some remapping going on there, (one can hook in new ones if needed); a complication is that the mapping may change per font (simplified or not)

...

get printed in CMR. I did some tests and I could change the font in any other font I wanted to, just by using the normal ConTeXt font mechanisms. So I guess it is easy to mix Japanese fonts with normal Latin fonts.

the cmr comes from the main font handler so if you choose times or palatino it would come out that way; in chinese font switching is triggered by glyphs

...

With the ConTeXt example that I posted yesterday, I am already able to write Japanese in UTF-8, use a Unicoded Japanese font in ConTeXt, and get Japanese output. I hope the hard part is already behind me! :-) The only thing that still puzzles me is how I can add interglyph space so that TeX can break the lines. If someone can help, I would really appreciate it!

for that i need to have samples and fonts, Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Tim 't Hart

10:02 p.m.

Hello Hans, You wrote:

...

one of the first things to do is to collect fonts in suitable encodings and

...

post them somewhere (or at least post scripts that generate them)

And

...

for that i need to have samples and fonts,

I created a simple home page that will tell you where you can find some good Japanese (Unicode) fonts, and how I installed them so that they can be used in ConTeXt and dvipdfmx. The URL is: http://context.t-hart.com/ I have also posted some ConTeXt source files which will show you what I could do with Japanese fonts and ConTeXt right now. The PDF output files are downloadable as well, so you can see how everything should look like. Remember that I have only used ConTeXt for only a few months, so please don't have a heart attack when you see my flashy ConTeXt coding! ;-) There is also a small list of things to do or to keep in mind when making simple Japanese support. If someone has any more ideas, let me know. IMHO, I think we need to concentrate on supporting Unicode fonts at first and if that works, we will look at other font encodings. The most important feature to have right now is simple line breaking. Hans, please tell me what I can do to help implementing Japanese support in ConText, or what more information you need to get a better overview of things that need to be done. I don't know much about ConTeXt yet, but I'll promise to do my best. My best, Tim

Matthew Huggett

11 Jun 11 Jun

4:35 a.m.

...

Hans, please tell me what I can do to help implementing Japanese support in ConText, or what more information you need to get a better overview of things that need to be done. I don't know much about ConTeXt yet, but I'll promise to do my best.

My best, Tim

If you need any help with documentation (writing, proof-reading, etc.) let me know. Matt H.

Hans Hagen

15 Jun 15 Jun

11:03 p.m.

At 13:48 08/06/2003 +0200, Tim 't Hart wrote:

...

Then I decided to try ConTeXt's UTF-8 support. I created the following test file:

..... you mix up two mechanisms: (1) the one used for chinese is not utf but an installable multi glyph mechanism, where the first glyph triggers a font and the second a char (2) utf encodings directly map onto a font (needed to get hyphenation right) so what you need is either a didicated handler like chinese, or a plug in into the utf handler.

...

But since there are usually no spaces in a Japanese sentence, there is no line breaking. And as you can imagine, line breaking is a useful feature to have! :-)

A few questions; - How are the rules for breaking? - how many glyphs are there (well, i could look it up in the big cjk book) - what ranges do we use? (see unic-* files for uft handling) Can you make a small test suite? Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Matt Gushee

16 Jun 16 Jun

12:22 a.m.

On Sun, Jun 15, 2003 at 11:03:06PM +0200, Hans Hagen wrote:

...

A few questions;

- How are the rules for breaking?

For a detailed explanation, you should refer to the big book. But actually the rules are not all that difficult--probably a good deal simpler than European languages, I'd say. The most important thing to know is that there is a certain set of characters that may not occur at the end of a line, and another set that may not occur at the beginning, and I believe (it's been a while since I seriously looked at any of this) that there are certain unbreakable pairs, but not a huge number of them.

...

- how many glyphs are there (well, i could look it up in the big cjk book)

That's rather a tricky question, and the answer depends partly on whether you want a complete solution or an 80/20 one. You probably know that there are two main character sets in Japanese: jis-x-0208 and jis-x-0212 (of course, the full names are suffixed with years, but I forget what the current versions are). The vast majority of all Japanese text (notice I said text, *not* documents) can be written with hiragana and katakana (50+ characters each), roman alphabet (256, I guess?), and the kanji in jis-x-0208, of which there are about 6000. However, it's hard to get away without using jis-x-0212. Literary terms and probably some specialized scientific vocabulary often require it, and most critically, geographic and personal names very often use jis-x-0212 characters. It's common to find names whose characters are represented in jis-x-0208, but for any given name you must use a different glyph that is in jis-x-0212. In Japanese culture it is unacceptable to substitute glyphs in names. An analogy in Western languages might be: suppose you had a typesetting system that was incapable of rendering the string "sen" at the end of the word. Thus, whenever yyou encountered the names Andersen or Olsen, you would print them as "Anderson" and "Olson." I don't think anyone would consider that acceptable. So the upshot of this is that, though jis-x-0212 glyphs make up a very small proportion of the Japanese text that is printed (I'd guess 1-2 percent), a large proportion of documents (40-50 percent, maybe) require one or more glyphs from that set. So that's another 8000 glyphs, if you want to do it right. One other point that may or may not matter is that ... I'm not sure if this is the correct terminology, but the code points of the Japanese character sets are arrayed in a sparse matrix (?). Each plane is 194x194, rather than 256x256. I used to know why. -- Matt Gushee When a nation follows the Way, Englewood, Colorado, USA Horses bear manure through mgushee@havenrock.com its fields; http://www.havenrock.com/ When a nation ignores the Way, Horses bear soldiers through its streets. --Lao Tzu (Peter Merel, trans.)

Hans Hagen

9:55 a.m.

At 16:22 15/06/2003 -0600, you wrote:

...

For a detailed explanation, you should refer to the big book. But actually the rules are not all that difficult--probably a good deal simpler than European languages, I'd say. The most important thing to know is that there is a certain set of characters that may not occur at the end of a line, and another set that may not occur at the beginning, and I believe (it's been a while since I seriously looked at any of this) that there are certain unbreakable pairs, but not a huge number of them.

ok, so that's like chinese; now, how about numbering [chinese have multiple systems] and labels [chinese has pre/post labels]?

...

...
- how many glyphs are there (well, i could look it up in the big cjk book)

nice explanation, i think you all should team up in making a nice manual about this! Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

Tim 't Hart

6:37 a.m.

Hans Hagen wrote:

...

you mix up two mechanisms:

Yes, after studying the Chinese module for a while, I also came to the conclusion that I mixed up bad! :-) So instead of enjoying the nice weather during the weekend, I wrote some mapping files that will create subfonts for EUC-JP encoding. Each subfont contains glyphs with the same first byte, just like the idea behind the Chinese module. Then I wrote a basic 'font-jpn.tex' file and now I can write Japanese in EUC-JP encoding, including basic line breaking! I was still working on this and wanted to release it when it was more useful, but I guess I have to speed things up now. Also, since I'm not an expert in ConText, I'm sure I'm doing some things completely the wrong way, so I think it's good if someone else will take a look at it. There is a lot to improve! :-)

...

- How are the rules for breaking?

The rules are basically the same as in Chinese. Japanese also contains smaller versions of the kana (hiragana and katakana) glyphs, and breaking before those is not allowed as well. Also, there seems to be different classes of breaking: for some characters breaking is strictly forbidden, and for some it is slightly forbidden. (I guess they mean that you should not break slightly forbidden characters, but if the penalty is too bad, break them anyway)

...

Can you make a small test suite?

Yes, I am at work right now, but when I get back, I'll send you the mapping files to make the fonts, the font-jpn file I was working on, and some other things like sample files and line breaking rules. My best, Tim

Hans Hagen

9:51 a.m.

At 06:37 16/06/2003 +0200, you wrote:

...

Hans Hagen wrote:

...
you mix up two mechanisms:

Yes, after studying the Chinese module for a while, I also came to the conclusion that I mixed up bad! :-)

So instead of enjoying the nice weather during the weekend, I wrote some mapping files that will create subfonts for EUC-JP encoding. Each subfont contains glyphs with the same first byte, just like the idea behind the Chinese module.

Then I wrote a basic 'font-jpn.tex' file and now I can write Japanese in EUC-JP encoding, including basic line breaking!

I was still working on this and wanted to release it when it was more useful, but I guess I have to speed things up now. Also, since I'm not an expert in ConText, I'm sure I'm doing some things completely the wrong way, so I think it's good if someone else will take a look at it. There is a lot to improve! :-)

wang lei (chinese) and chof (korean) are experts in that area Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------

8052

Age (days ago)

8060

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Hans Hagen
Matt Gushee
Matthew Huggett
Tim 't Hart