UTF8 problems with Hangul Syllables
Here is my comment and question on the new feature of ConTeXt supporting the UTF8 encoding. Actually I tried to test the following short ConTeXt document containing two Korean characters. At the second line I used the Bitstream Cyberbit font and the corresponding TFM files were generated by ttf2tfm with Unicode.sfd (the same way as the UTF8 support in CJK-LaTeX). \enableregime [utf] \definefontsynonym [UnicodeRegular] [cyberb] \chardef\utfunihashmode=1 \starttext ^^eb^^bf^^a1 ^^ec^^80^^80 \stoptext Here, ^^eb^^bf^^a1 = U+BFE1 and ^^ec^^80^^80 = U+C000. 1. Without the third line (\chardef\utfunihashmode=1), I could not see any characters. Why? 2. After enabling \utfunihashmode, I could see the first character. But not the second character. The difference was that the value of \unidiv were 191 for the first character and 192 for the second character. In fact, all characters with \unidiv >= 192 and \unidiv <= 223 (from U+C000 to U+DFFF; half of Hangul Syllables) were not shown correctly. Why? Anyway, it is now possible to get a PDF file containing several different languages with ConTeXt + dvipdfmx. Furthermore, the texts in the PDF file can be searched and extracted. Bookmarks and text annotations too! I used the following map entry (usually in cid-x.map) for dvipdfmx. cyberb@Unicode@ Identity-H :0:cyberbit.ttf Best, ChoF. -- ~~~~~~~~~~~~~~~~~~~~~~~~~ *** | Cho, Jin-Hwan == ChoF | ^ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~ o | Research Fellow | ~~~ | School of Mathematics ~~~~~~~~~~~~~~ | Korea Institute for Advanced Study | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | chofchof@ktug.or.kr | | http://free.kaist.ac.kr/ChoF/ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At 01:49 PM 12/18/2002 +0900, you wrote:
Here is my comment and question on the new feature of ConTeXt supporting the UTF8 encoding.
Actually I tried to test the following short ConTeXt document containing two Korean characters. At the second line I used the Bitstream Cyberbit font and the corresponding TFM files were generated by ttf2tfm with Unicode.sfd (the same way as the UTF8 support in CJK-LaTeX).
\enableregime [utf] \definefontsynonym [UnicodeRegular] [cyberb] \chardef\utfunihashmode=1 \starttext ^^eb^^bf^^a1 ^^ec^^80^^80 \stoptext
Here, ^^eb^^bf^^a1 = U+BFE1 and ^^ec^^80^^80 = U+C000.
1. Without the third line (\chardef\utfunihashmode=1), I could not see any characters. Why?
2. After enabling \utfunihashmode, I could see the first character. But not the second character. The difference was that the value of \unidiv were 191 for the first character and 192 for the second character. In fact, all characters with \unidiv >= 192 and \unidiv <= 223 (from U+C000 to U+DFFF; half of Hangul Syllables) were not shown correctly. Why?
I'll work this out asap; this is what i use as test file (unfortunately this font does not show chars, so i have do download a proper font first); i attached a script that i apply to a ttf file ( ttftfmxx.pl htfs.ttf 0 255 ) \chardef\utfunihashmode=1 \pdfmapfile{+htfsxx.map} \definefontsynonym [TestRegular] [htfs] \defineunicodefont [SomeFont] [Test] \SomeFont \enableregime[utf] % todo: autoutf, else problem \starttekst %^^eb^^bf^^a1 %^^ec^^80^^80 \utfunifontglyph{\numexpr("BFE1)} \utfunifontglyph{\numexpr("C000)} \stoptekst ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------
Hans Hagen wrote:
I'll work this out asap; this is what i use as test file (unfortunately this font does not show chars, so i have do download a proper font first); i attached a script that i apply to a ttf file ( ttftfmxx.pl htfs.ttf 0 255 )
\chardef\utfunihashmode=1 \pdfmapfile{+htfsxx.map} \definefontsynonym [TestRegular] [htfs] \defineunicodefont [SomeFont] [Test] \SomeFont \enableregime[utf] % todo: autoutf, else problem \starttekst %^^eb^^bf^^a1 %^^ec^^80^^80 \utfunifontglyph{\numexpr("BFE1)} \utfunifontglyph{\numexpr("C000)} \stoptekst
Even though I used \utfunifontglyph{\numexpr("C000)} instead of ^^ec^^80^^80, the result was the same, that is, the character was not shown correctly (= empty). I forgot one thing to comment. Bitstream Cyberbit font does not have the glyph for the character U+C000. So it may be better to test the character U+C0C1 (= ^^ec^^83^^81). Bitstream Cyberbit font (Cyberbit.ZIP) can be download from http://ftp.netscape.com/pub/communicator/extras/fonts/windows/ Here is the log message after turnning on \tracingmacros. The difference is that "BFE1 calls \unicodeglyph, but "C0C1 calls \doutfunihsh. 1. Log message for \utfunifontglyph{\numexpr("BFE1)} ================================================= \utfunifontglyph #1->\xdef \unidiv {\number \utfdiv {#1}}\xdef \unimod {\number \utfmod {#1}}\ifnum #1<\utf@i \char \unimod \else \ifcsname \@@univector \unid iv \endcsname \csname \doutfunihash {\unidiv }{#1}\endcsname \else \unicodeglyp h \unidiv \unimod \fi \fi #1<-\numexpr ("BFE1) \utfdiv #1->\number \numexpr ((#1-\utf@g )/\utf@h ) #1<-\numexpr ("BFE1) \utfmod #1->\number \numexpr (#1-\utf@h *((#1-\utf@g )/\utf@h )) #1<-\numexpr ("BFE1) \@@univector ->univ \unidiv ->191 \unicodeglyph #1#2->\bgroup \getvalue {@@\currentucharmapping \strippedcsname \ uchar }{#1}{#2}\bodyfontsize \unicodescale \bodyfontsize \font \unicodefont =\t ruefontname {\truefontname \unicodestyle \unicodeone } at \currentfontscale \bo dyfontsize \unicodestrut \unicodefont \unicodecharcommand {\char \unicodetwo \r elax }\egroup #1<-\unidiv #2<-\unimod ... [REMOVED] 2. Log message for \utfunifontglyph{\numexpr("C0C1)} ================================================= \utfunifontglyph #1->\xdef \unidiv {\number \utfdiv {#1}}\xdef \unimod {\number \utfmod {#1}}\ifnum #1<\utf@i \char \unimod \else \ifcsname \@@univector \unid iv \endcsname \csname \doutfunihash {\unidiv }{#1}\endcsname \else \unicodeglyp h \unidiv \unimod \fi \fi #1<-\numexpr ("C0C1) \utfdiv #1->\number \numexpr ((#1-\utf@g )/\utf@h ) #1<-\numexpr ("C0C1) \utfmod #1->\number \numexpr (#1-\utf@h *((#1-\utf@g )/\utf@h )) #1<-\numexpr ("C0C1) \@@univector ->univ \unidiv ->192 \doutfunihash #1#2->\ifcsname \@@univector \number #1\endcsname \csname \@@univ ector #1\endcsname {\utfmod {#2}}\else \@@unknownchar \fi #1<-\unidiv #2<-\numexpr ("C0C1) \@@univector ->univ \unidiv ->192 \@@univector ->univ \unidiv ->192 \univ192 #1-> #1<-\utfmod {\numexpr ("C0C1)} [NO MESSAGE FURTHER] Best, ChoF. -- *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *** ^ ^ | ChoF := Jin-Hwan Cho | *^ ^* o | chofchof@ktug.or.kr | * o * *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *** *^ ^* | Project Manager of | ^ ^ * O * | DVIPDFMx and MiKTeX-KTUG | O ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Research Fellow, School of Mathematics | | Korea Institute for Advanced Study | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At 10:53 AM 12/19/2002 +0900, Cho, Jin-Hwan wrote:
Here is the log message after turnning on \tracingmacros. The difference is that "BFE1 calls \unicodeglyph, but "C0C1 calls \doutfunihsh.
can you adapt regi-utf.tex to : \dostepwiserecurse{192}{223}{1} {\expanded{\defineactiveinspector{\recurselevel} % space delimited {\noexpand\utftwouniglph{\recurselevel}}}% }%\letvalue{\@@univector\recurselevel}\gobbleoneargument} \dostepwiserecurse{224}{239}{1} {\expanded{\defineactiveinspector{\recurselevel} % space delimited {\noexpand\utfthreeuniglph{\recurselevel}}}% }%\letvalue{\@@univector\recurselevel}\gobbetwoarguments} \dostepwiserecurse{240}{247}{1} {\expanded{\defineactiveinspector{\recurselevel} % space delimited {\noexpand\utffouruniglph{\recurselevel}}}% }%\letvalue{\@@univector\recurselevel}\gobblethreearguments} i.e. comment the last lines i do get something now, but somehow ttf2pt1 does not like this font so i get invalid pfb's Hans ------------------------------------------------------------------------- Hans Hagen | PRAGMA ADE | pragma@wxs.nl Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com ------------------------------------------------------------------------- information: http://www.pragma-ade.com/roadmap.pdf documentation: http://www.pragma-ade.com/showcase.pdf -------------------------------------------------------------------------
Hans Hagen wrote:
can you adapt regi-utf.tex to :
\dostepwiserecurse{192}{223}{1} {\expanded{\defineactiveinspector{\recurselevel} % space delimited {\noexpand\utftwouniglph{\recurselevel}}}% }%\letvalue{\@@univector\recurselevel}\gobbleoneargument}
(... skipped ...)
i.e. comment the last lines
Good news. After commenting out the last lines, it worked. Best, ChoF. -- *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *** ^ ^ | ChoF := Jin-Hwan Cho | *^ ^* o | chofchof@ktug.or.kr | * o * *** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *** *^ ^* | Project Manager of | ^ ^ * O * | DVIPDFMx and MiKTeX-KTUG | O ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Research Fellow, School of Mathematics | | Korea Institute for Advanced Study | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
participants (2)
-
Cho, Jin-Hwan
-
Hans Hagen