On Mon, Jun 09, 2003 at 11:16:27PM +0900, Matthew Huggett wrote:
Recently, I've made the 'unwise' decision to start studying Japanese next year,
Unwise? Only if you don't really want to do it, or if you are laboring under illusions--left over from the 80s--that it will guarantee you a lucrative and glamorous career in international trade ;-) But anyway, I am also interested in using ConTeXt for Japanese, and would be glad to contribute what I can to this effort.
I asked about Japanese a while back. Hans requested more information on encodings, fonts, etc. I don't know enough about these things or ConTeXt to know what is needed exactly.
I don't know much about ConTeXt internals, but do know something about "these things," so I may be able to help. Was Hans' request on the mailing list? If you know when it was posted, perhaps I can look it up.
Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems:
On Mon, Jun 09, 2003 at 06:33:49PM +0200, Tim 't Hart wrote:
Unicode wasn't that popular because Unix-like operating systems used EUC as encoding, and Microsoft used their own invented Shift-JIS encoding.
There were also cultural/political reasons, with perhaps a touch of Not Invented Here syndrome. But that's a different story.
So there is still a lot of digital text out there written in these encodings, and a lot of tools still use it. But I think that if you want to write new texts, using Unicode shouldn't be a problem for most users. I guess that most editors supporting Asian encodings also make it possible to save in UTF-8. I think nowadays it's easier to find a Unicode enabled editor than it is to find a Shift-JIS/EUC editor! (Well, on Windows anyway...).
Yes, recent Windows versions (starting with NT 4.0 in the business series, and ... not sure ... ME? in the consumer series) use some form of Unicode as their base encoding, so I think it is now the norm for Windows text editors to support UTF-8 ... I'm pretty sure TextPad does, for example.
Since ConTeXt already supports UTF-8, I don't see a reason to make thinks more difficult than they already are by writing text in other encodings.
On the face of it that makes sense. But I don't think it's safe to make a blanket assumption that the text in a ConTeXt document will originate with the creator of the document, or that it will be newly written. Also, UTF-8 support is still a bit half-baked on Unix/Linux systems.
When I look at the source of the Chinese module, the most difficult part for me to understand is the part about font encoding, the enco-chi.tex file, and the use of \defineuclass in that file. I guess it has to do something with mapping the written text to the font.
Most likely. I might be able to glean something useful from that file. I'll take a look when I can find the time.
I guess that if you want to make a proper Japanese module, you'll need to support JIS or Shift-JIS encoded fonts.
This would be a good idea for Type 1 font support. It seems to me that almost all recent Japanese TrueType fonts have a Unicode CMap.
But on the other hand, maybe we don't need to support that since there are a lot of Japanese Unicode fonts available. I use WinXP, and there we have msmincho.ttc and msgothic.ttc, which are both Unicode fonts.
Can PDFTeX handle TTC files? I know ttf2afm/ttf2pk can process them, but I have tried 2 or 3 times to include a Japanese TTC font directly in a PDFTeX document, but was never able to make it work.
And Cyberbit is a Unicoded font as well. Commercially available fonts by Dynalab (Dynafont Japanese TrueType collection is quite cheap and very good) are also Unicode fonts. Again, I don't think we should make it difficult for ourselves by trying to support non-Unicode fonts while unicoded Japanese fonts are easy to use and widely available.
Well, it can be done in stages. I think that any serious attempt to support Japanese in ConTeXt should encompass all common encodings. But I don't see anything wrong with starting out Unicode-only.
Typesetting Japanese could be more complicated than Chinese because of the concurrent use of four writing systems
The fact that Japanese uses four writing systems is not really a problem.
Maybe it's not a big problem. But it is certainly more complex than chinese, since there is a mixture of proportional and fixed-width characters, and the presence of Kana and Romaji complicate the line-breaking rules.
I guess I need to track down a few sample documents. I tried to turn up some info on Japanese typesetting rules but had no luck.
What would a good sample consist of? I can probably find something.
The only info I got is from Ken Lunde's CJKV book, where he mentions some rules about CJK line breaking.
Yes, Lunde is good, but he doesn't go into enough detail to serve as an implementor's guide. I've also searched for more info on this subject; my impression is that besides Lunde's books there is really nothing available in English. I could probably make some sense out of the Japanese works that are available, but it would take up much more time than I have.
With the ConTeXt example that I posted yesterday, I am already able to write Japanese in UTF-8, use a Unicoded Japanese font in ConTeXt, and get Japanese output. I hope the hard part is already behind me! :-) The only thing that still puzzles me is how I can add interglyph space so that TeX can break the lines. If someone can help, I would really appreciate it!
Sorry, no idea. But it sounds like you've made an admirable effort so far. I was working along similar lines a couple of years ago, but was never able to produce anything useful. Guess you're a better TeXnician than I. -- Matt Gushee When a nation follows the Way, Englewood, Colorado, USA Horses bear manure through mgushee@havenrock.com its fields; http://www.havenrock.com/ When a nation ignores the Way, Horses bear soldiers through its streets. --Lao Tzu (Peter Merel, trans.)