Here's a short version of my question: How do I enable unicode encoded characters (just normal accented latin characters) to be typeset in (any font) in ConTeXt, like the \usepackage[utf8]{inputenc} in LaTeX? And here the long one: ************************************************************************ I don't really understand how accented characters are typeset in (Con)TeX(t). One of the main reasons for switching to LaTeX (maybe 8 years ago) someone mentioned was: "You don't have to worry about accented characters. You can make any accented character and it will work all over the world." (We actually did have lots of problems with MS Word and web browsers at that time.) And it was true. But when I switched to ConTeXt I came against that problem again. In LaTeX I used \v{c}\v{s}\v{z} at first, later \usepackage{csz} ... "c"s"z (which works pretty much the same as "a"o"u in German) and finally (when someone told me about that possibility) \usepackage[utf8]{inputenc} ... čšž As I didn't know how to use any other the font, I always used CMR, the default, so I didn't have problems with exotic fonts either. ************************************************************************ But here we come to ConTeXt. For the German "Umlaut", \"{a}\"{o}\"{u} (äöü), this was satisfactory: \useencoding[windows-1250] \mainlanguage[de] For \v{c}\v{s}\v{z} (čšž) this wasn't the case, so a proposed solution from another ConTeXt user was: % output=pdf -translate-file=cp1250cs \setupbodyfont [csr,ams,rm] What I don't really understand: why did the Chech TUG have to design *their own font*, csr, (or made changes to cmr) if accented characters worked perfectly already in plain TeX? The second problem: This works under Windows when typesetting in code page 1250. How can I use accented characters if text is typeset in Unicode (or latin2) in Linux? The third problem: How do I typeset '\v{c}' in some other font? I do understand that it may not function in just any font since someone has to tell the computer how the accented characters are built, but as long as \v{c} works, there's no reason for \useencoding[utf8] and then continuing with unicode encoded characters not to produce the desired result. Thank you, Mojca
Mojca Miklavec wrote:
But when I switched to ConTeXt I came against that problem again.
In LaTeX I used \v{c}\v{s}\v{z}
this also works in context
at first, later \usepackage{csz} ... "c"s"z
in this case, i assume that csz makes " active and such; if you really want that , we shoul dmake an enco-fcz, with definitions like: \startlanguagespecifics[cz] \appendtoks \makecharacteractive " \to \everynormalcatcodes \installcompoundcharacter "c {\v{c}} \installcompoundcharacter "s {\v{s}} \installcompoundcharacter "z {\v{z}} \stoplanguagespecifics and alike; if you want utf, you should say (at the top of the file) \enableregime[utf]
As I didn't know how to use any other the font, I always used CMR, the default, so I didn't have problems with exotic fonts either.
this should work with all fonts, since there are fallback definitions
% output=pdf -translate-file=cp1250cs \setupbodyfont [csr,ams,rm]
try to avoid code pages
What I don't really understand: why did the Chech TUG have to design *their own font*, csr, (or made changes to cmr) if accented characters worked perfectly already in plain TeX?
in cmr \v{s} is actually two characters, while in csr it's one (composed) character (built of two characters but seen as one); therefore when you use csr fonts, you can get proper hyphenation (which is notthe case in cmr where the usage of \accent primitive spoils the game); next year, when i can assume that the new latin modern fonts are available everywhere, i will drop cmr as default cum suis in favor of lsr (which has cmr, plr, csr, vnr, aer etc included)
The second problem: This works under Windows when typesetting in code page 1250. How can I use accented characters if text is typeset in Unicode (or latin2) in Linux?
you probably need to configure you reditor to use utf
The third problem: How do I typeset '\v{c}' in some other font? I do understand that it may not function in just any font since someone has to tell the computer how the accented characters are built, but as long as \v{c} works, there's no reason for \useencoding[utf8] and then continuing with unicode encoded characters not to produce the desired result.
don't worry, other fonts work ok; if an encoding does not support the chars you need, a composed char is constructed; [font encodings have othing to do with input encoding but there do influence hyphenations] if i'm right, ec, texnansi, and qx encoding all serve your purpose Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen wrote:
and alike; if you want utf, you should say (at the top of the file)
\enableregime[utf]
Thanks for many other advices also, but especially for this one: I probably already tried this out. Well, almost ;). Since niether \enableregime[utf8] nor \enableregime[utf-8] resulted in the desired output. (I was always used to write '8' after utf since utf-16 and some others exist as well.) Thank you, Mojca
On Mon, 20 Dec 2004 21:52:17 +0100, Hans Hagen
Mojca Miklavec wrote: [...]
The second problem: This works under Windows when typesetting in code page 1250. How can I use accented characters if text is typeset in Unicode (or latin2) in Linux?
you probably need to configure you reditor to use utf
Under Linux I use vim/gvim, gedit, gtk2edit for editing Vietnamese text in UTF-8 without any problem :)
Mojca, In reply to your question:
I don't really understand how accented characters are typeset in (Con)TeX(t). One of the main reasons for switching to LaTeX (maybe 8 years ago) someone mentioned was: "You don't have to worry about accented characters. You can make any accented character and it will work all over the world." (We actually did have lots of problems with MS Word and web browsers at that time.) And it was true.
You know that all characters in a font have a number. If you type a, the font mechanism makes sure that you see an a. In reality the font shows you the character that is put on the numerical position of a. In the font dingbats for example, the character on that position is not an a, but a symbol. In Latex the combination \"{a} can mean two things: 1. in most fonts: show the charachter on the a given numerical position, which means that there is one character ä. 2. in some other fonts \"{a} means: combine " with a and make an ä. This means that " is combined with the character on the numerical position of a. TeX does this very well and thus construes very acceptable diacritical signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts. If you have a font which contains \"{q}, \d{o} or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under Latex are for. That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. \v{s} and \v{c} that TeX does not have to create anew from two signs. Kind regards, Robert
r.ermers@hccnet.nl said this at Tue, 21 Dec 2004 08:56:40 +0100:
In Latex the combination \"{a} can mean two things: 1. in most fonts: show the charachter on the a given numerical position, which means that there is one character ä.
2. in some other fonts \"{a} means: combine " with a and make an ä. This means that " is combined with the character on the numerical position of a. TeX does this very well and thus construes very acceptable diacritical signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts.
Robert, That's a helpful explanation. I'll try to expand on that in the ConTeXt case, just in case people are curious or are led into thinking it's just the same: In ConTeXt, the combination \"{a} means one thing: \adiaeresis (see enco- acc). This \adiaeresis can mean one of two things, depending on the encoding: 1. Numerical position, or 2. The fallback case (defined in enco-def), where a diaeresis/umlaut is placed atop an 'a' glyph. Hyphenation implications as Hans described. The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type ä in your given input regime, and that encoding is properly set, that numerical ä (e.g., character #228 in the windows regime) is mapped to \adiaeresis. Wanna know what happens in UTF-8? Here's my 'simplified' explanation: In a UTF-8 bytestream, that character "ä" is signified by two bytes: 0xC3, 0xA4. That first byte triggers a conversion of both bytes into two different bytes, the actual Unicode number, 0x00 0xE4 (or: 0, 228). ConTeXt then looks into internal hashes set up (in this case, the unic- 000 vector), looks at the 228th element, and sees that it's \adiaeresis. Things then proceed as normal. :) (It's also interesting to note that for PostScript and TrueType fonts, that number > name > number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.) -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Adam T. Lindsay, Computing Dept. atl@comp.lancs.ac.uk Lancaster University, InfoLab21 +44(0)1524/510.514 Lancaster, LA1 4WA, UK Fax:+44(0)1524/510.492 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
participants (5)
-
Adam Lindsay
-
Hans Hagen
-
Mojca Miklavec
-
r.ermers@hccnet.nl
-
VnPenguin