On Sun, Jun 15, 2003 at 11:03:06PM +0200, Hans Hagen wrote:
A few questions;
- How are the rules for breaking?
For a detailed explanation, you should refer to the big book. But actually the rules are not all that difficult--probably a good deal simpler than European languages, I'd say. The most important thing to know is that there is a certain set of characters that may not occur at the end of a line, and another set that may not occur at the beginning, and I believe (it's been a while since I seriously looked at any of this) that there are certain unbreakable pairs, but not a huge number of them.
- how many glyphs are there (well, i could look it up in the big cjk book)
That's rather a tricky question, and the answer depends partly on whether you want a complete solution or an 80/20 one. You probably know that there are two main character sets in Japanese: jis-x-0208 and jis-x-0212 (of course, the full names are suffixed with years, but I forget what the current versions are). The vast majority of all Japanese text (notice I said text, *not* documents) can be written with hiragana and katakana (50+ characters each), roman alphabet (256, I guess?), and the kanji in jis-x-0208, of which there are about 6000. However, it's hard to get away without using jis-x-0212. Literary terms and probably some specialized scientific vocabulary often require it, and most critically, geographic and personal names very often use jis-x-0212 characters. It's common to find names whose characters are represented in jis-x-0208, but for any given name you must use a different glyph that is in jis-x-0212. In Japanese culture it is unacceptable to substitute glyphs in names. An analogy in Western languages might be: suppose you had a typesetting system that was incapable of rendering the string "sen" at the end of the word. Thus, whenever yyou encountered the names Andersen or Olsen, you would print them as "Anderson" and "Olson." I don't think anyone would consider that acceptable. So the upshot of this is that, though jis-x-0212 glyphs make up a very small proportion of the Japanese text that is printed (I'd guess 1-2 percent), a large proportion of documents (40-50 percent, maybe) require one or more glyphs from that set. So that's another 8000 glyphs, if you want to do it right. One other point that may or may not matter is that ... I'm not sure if this is the correct terminology, but the code points of the Japanese character sets are arrayed in a sparse matrix (?). Each plane is 194x194, rather than 256x256. I used to know why. -- Matt Gushee When a nation follows the Way, Englewood, Colorado, USA Horses bear manure through mgushee@havenrock.com its fields; http://www.havenrock.com/ When a nation ignores the Way, Horses bear soldiers through its streets. --Lao Tzu (Peter Merel, trans.)