Basic question on Unicode and ConTeXt

Steffen Wolfrum

14 Jul 2005 14 Jul '05

11:30 a.m.

Hi, now and then I saw threads on this list dealing with specific problems of using various languages with utf-8 input in ConTeXt (processing with pdftex, NOT xetex). I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input? Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language? (If the platform makes a difference, I'd be interested in OSX.) Thank you, Steffen

Show replies by date

Henning Hraban Ramm

14 Jul 14 Jul

12:29 p.m.

Am 2005-07-14 um 11:30 schrieb Steffen Wolfrum:

...

I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?

Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?

You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki! Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Steffen Wolfrum

9:13 p.m.

Hi Henning, Zitat von Henning Hraban Ramm :

...

Am 2005-07-14 um 11:30 schrieb Steffen Wolfrum:

...
I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?

Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?

You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!

Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more: But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic) Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding? Shouldn't all UTF relevant examples be listed under UTF? So,sorry for starting this irrelevant thread, Steffen

VnPenguin

9:32 p.m.

On 7/14/05, Steffen Wolfrum wrote:

...

Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more:

But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)

Sorry, I can not understand your question. Vietnamese can you TeX/LaTeX and ConTeXt with different input encodings: TCVN, VISCII, VPS or UTF-8. I'm using currently ConTeXt UTF-8 input for ConTeXt no problem. Not yet tested with another input encoding, but no more problem with TeX/LaTeX, so should be ok with ConTeXt, i'm wrong? -- http://vnoss.org Vietnamese Open Source Software Community

Radhelorn

15 Jul 15 Jul

7:16 a.m.

Steffen Wolfrum wrote:

...

...
...
I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?

Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?

You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!

Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more:

But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)

Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding?

Shouldn't all UTF relevant examples be listed under UTF?

\enableregime is not enough. You need to setup font encoding and appropriate bodyfont. For these see type-enc, type-pre and such. Example for cyrillic: \enableregime [utf] \setupencoding [default=t2a] \usetypescript [modern-base] [\defaultencoding] \setupbodyfont [modern] \starttext Тест. \stoptext -- Radhelorn

Henning Hraban Ramm

11:09 a.m.

Am 2005-07-14 um 21:13 schrieb Steffen Wolfrum:

...

But, why is the Vietnamese example with \enableregime[utf] linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)

Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding? Shouldn't all UTF relevant examples be listed under UTF?

All examples are (could be) relevant for UTF-8, because you can set (nearly) everything in Unicode. VISCII is one possible encoding for Vietnamese (and only for Vietnamese), so I found it rather logical to link from there to V., even if the V. example uses UTF-8, which is probably more modern - as probably a lot of other encodings are obsolete/deprecated. So, even if the V. example could be considered a general UTF-8 example, it shows how one can (and perhaps should) typeset V. So I guess the only error or missing link is the link from UTF-8 to Vietnamese (and Cyrillic). Do it yourself as you please. Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Mojca Miklavec

8:43 p.m.

Henning Hraban Ramm wrote:

...

You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!

Thank you! It was probably me who copy-pasted some of the material there from some thread, but when I looked at it once again, I learnt something new. A while ago I was asking how to typeset things in windows-1250 encoding (\usepackage[cp1250]{inputenc} in LaTeX). I got some answer (just a temporary solution with csr fonts), but it was not a satisfying one. I'm now attaching a file for support for windows-1250-encoded files. One character is missing (I don't know what to write for non-breaking space) and it's not extensively tested or proved for typos. So if someone can drop an eye on it, I'll be glad. Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)? regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too. \defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ} This should be \dstroke and \Dstroke. Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead. Mojca

Henning Hraban Ramm

8:59 p.m.

New subject: hungarumlaut (was: Basic question on Unicode)

Am 2005-07-15 um 20:43 schrieb Mojca Miklavec:

...

Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead.

AFAIK the name is PostScript standard - Adobe used some strange names... Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Mojca Miklavec

11:13 p.m.

New subject: ISO/windows encodings (was: Basic question on Unicode ...)

(Sorry, I should have opened another thread already before.) I have another couple of questions about regimes support. How can synonyms for regimes be defined, so that \enableregime[windows-1250] would have the same effect as \enableregime[win-1250] or \enableregime[cp1250]? And \enableregime[utf8] the same effect as \enableregime[utf]. I don't won't to be discriminating, but \enableregime[windows] is like writing \enableregime[latin] ("il" in ConTeXt I think) and expecting the whole world to understand that you mean latin1. In my opinion it should be left there (for backward compatibility if for nothing else), but deprecated and given an unambigious name like "windows-1252", "windows1252", "win-1252", "win1252", "cp1252" or "windows-western".

...

Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)?

(Seems like I should have learnt for my math exam tomorrow instead of writing this.) I meant if someone has a nice macro or a script to produce the 256 (not 266!) characters table (minus non-printable ones), maybe together with the corresponding name (only if it can still be extracted). It should either look like an usual ASCII table (perhaps with a box around like in TeX font tables) or simply one character per line with a decimal and hex number written. More or less in order to be able to test if the regi-* files are OK. I prepared the file by hand, but now as I know where to look for and after I saw the http://czyborra.com/charsets/iso8859.html page, I think it shouldn't be a problem to prepare support for all those usual and unusual encodings at once (only a clever script and some manually-prepared mapping from unicode to ConTeXt names). Unicode is great, but not everyone uses it (even vim behaves pretty system-dependent and cannot always be used for unicode out-of-the-box). (I also forgot to thank Patrick for explaining me some stuff about regimes.) Mojca

Hans Hagen

18 Jul 18 Jul

1:38 a.m.

New subject: ISO/windows encodings

Mojca Miklavec wrote:

...

I have another couple of questions about regimes support.

How can synonyms for regimes be defined, so that \enableregime[windows-1250] would have the same effect as \enableregime[win-1250] or \enableregime[cp1250]? And \enableregime[utf8] the same effect as \enableregime[utf].

I don't won't to be discriminating, but \enableregime[windows] is like writing \enableregime[latin] ("il" in ConTeXt I think) and expecting the whole world to understand that you mean latin1. In my opinion it should be left there (for backward compatibility if for nothing else), but deprecated and given an unambigious name like "windows-1252", "windows1252", "win-1252", "win1252", "cp1252" or "windows-western".

I'll send you a few lines of code to test Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Hans Hagen

17 Jul 17 Jul

10:01 p.m.

Mojca Miklavec wrote:

...

regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.

\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}

This should be \dstroke and \Dstroke.

ok, changed

...

Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead.

the names probably come from postscript btw, there is a differnece between umlaut and diaeresis (height) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

VnPenguin

18 Jul 18 Jul

7:50 a.m.

On 7/17/05, Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.

\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}

This should be \dstroke and \Dstroke.

ok, changed

yes, there are also exactly two glyphs \dstroke and \Dstroke in Vietnamese :) Cheers, -- http://vnoss.org Vietnamese Open Source Software Community

Mojca Miklavec

10:26 p.m.

Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.

\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}

This should be \dstroke and \Dstroke.

ok, changed

Thank you. \Dstroke has some "problems" anyway, at least in cmr (lmr?). The stroke should be on the left, but it is on the right. I thought it was just because \tt don't have that glyph, but also the roman version is rendered extremely bad.

...

...
Where did the "hungarumlaut" characters get the name from?

the names probably come from postscript

Thanks, I looked into some .afm files and they were actually there.

...

btw, there is a differnece between umlaut and diaeresis (height)

So what is the proper way of writing 'ä' (a umlaut) then?

...

can't you make it into a

\defineactivetoken 128 {\texteuro} % € 20AC EURO SIGN

kind of table?

Good idea indeed, it looks much nicer this way.

...

maybe a better name is regi-ce or just regi-1250

regi-ce is a bad name as there are 4 central european encodings (IBM-853, ISO-8859-2, MacCE and Windows-1250) plus Croatian. 1250 alone is probably OK, but there's no hint in file name about which encoding is meant (windows/ibm/iso/mac ...). I tested the code for regime synonyms and it looks OK. Thanks for investingating my request :)

...

(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)

What do you mean by writing file synonyms? Where would it be used? For unicode regimes, this is probably an useful (more or less complete) set. \defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf] \defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf] (Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.) For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.

...

there are

\showcharacters \showaccents

Thank you. The commands were only kind-of-working here. They produced the table that I wanted (and quite some trash as well), but they were complaining a lot. Thanks for the contribution into Visual debugging, Hraban! What's the proper name for nonbreaking space, '~', to be put in regi-* file? Mojca

Hans Hagen

11:46 p.m.

Mojca Miklavec wrote:

...

\Dstroke has some "problems" anyway, at least in cmr (lmr?). The stroke should be on the left, but it is on the right. I thought it was just because \tt don't have that glyph, but also the roman version is rendered extremely bad.

in case of doubt, you can discuss this with Boguslaw Jackowski (jacko) who is in charge of latin modern; it shoul dbe ok in latin roman

...

So what is the proper way of writing 'ä' (a umlaut) then?

in german mode, "u will produce it (tricky since ther eis no hyphenation then) latin modern did have them and there is a special encoding vector in the context distribution (awaiting for those umlaust to show up again) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Hans Hagen

11:54 p.m.

Mojca Miklavec wrote:

...

...
maybe a better name is regi-ce or just regi-1250

regi-ce is a bad name as there are 4 central european encodings (IBM-853, ISO-8859-2, MacCE and Windows-1250) plus Croatian. 1250 alone is probably OK, but there's no hint in file name about which encoding is meant (windows/ibm/iso/mac ...).

I tested the code for regime synonyms and it looks OK. Thanks for investingating my request :)

ok, i'll add it to enco-ini then

...

...
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)

What do you mean by writing file synonyms? Where would it be used?

\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]

...

For unicode regimes, this is probably an useful (more or less complete) set.

\defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf]

the spacy one does not make much sense

...

\defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf]

not sure about this one

...

(Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.)

For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.

indeed, i'll wait patiently for your complete list of synonyms

...

...
there are

\showcharacters \showaccents

Thank you. The commands were only kind-of-working here. They produced the table that I wanted (and quite some trash as well), but they were complaining a lot.

Thanks for the contribution into Visual debugging, Hraban!

What's the proper name for nonbreaking space, '~', to be put in regi-* file?

how about \nonbreakablespace Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

19 Jul 19 Jul

1:11 a.m.

Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
...
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)

What do you mean by writing file synonyms? Where would it be used?

\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]

Ok, if you are provocating, I'll strike back: None of the definitions above are allowed because they don't warn the user if he's using the wrong name. They should throw an error instead. The only proper way would be to define something like \setuplabeltext[\s!en][\v!pronouncemyname=moitsa] \setuplabeltext[\s!de][\v!pronouncemyname=mojza] \setuplabeltext[\s!ru][\v!pronouncemyname=мойца] ...

...

...
For unicode regimes, this is probably an useful (more or less complete) set.

\defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf]

the spacy one does not make much sense

...
\defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf]

not sure about this one

Me neither, but "utf" alone is just as doubtful as this one. However, leaving utf-8 and utf8 only is OK.

...

...
(Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.)

For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.

indeed, i'll wait patiently for your complete list of synonyms

OK. I'll prepare \defineregimesynonym-s proposals, but I still don't know what the file synonyms should be used for in this context. The user probably doesn't need to care about file names?

...

...
What's the proper name for nonbreaking space, '~', to be put in regi-* file?

how about \nonbreakablespace

Thanks. There was no such glyph in \showcharacters -) (PS: I'm sorry for accusing the innocent commands of \showcharacters and \showaccents for the missfunctionality. I accidentaly placed them after an \obeylines command as I was debugging some files. They couldn't have worked there anyway.) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% I wanted to post this in another thread, but it probably still fits on this place: The regi-* files currently map characters from individual encodings directly into \TeXcommands. But unicode is already supported in ConTeXt and the mappings from single file encodings into unicode are pretty well defined (perhaps there are some exceptions?) and can be obtained elsewhere on the internet. On the other hand, mapping from unicode to \TeXcommands is much less straightforward and sometimes subjective. I noticed some comments in regi-* files like % \texttrademark changed to \trademark or % \dots changed to \textellipsis The one who does the changes like that probably does them only in one file, the rest remains as is (and probably becomes deprecated if not unfunctional one day). On the other hand, there are around ten different cyrilic encodings (mostly they are already supported by ConTeXt, but anyway) and many other encodings in other languages as well. This means that the same cyrilic letter has to be assigned the name in ten files (regimes), possibly manually. So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later." Is it possible to switch the regimes in the middle of the document (like it is possible to switch the languages)? An example usage would be if some input documents (plain text, some older TeX files or database entries) are written in some other encoding than the main stream. (Possibly switching in such a way that no leftovers remain after the old encoding is replaced by a new one.) Mojca

Hans Hagen

10:06 a.m.

Mojca Miklavec wrote:

...

Hans Hagen wrote:

...
Mojca Miklavec wrote:

...
...
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)

What do you mean by writing file synonyms? Where would it be used?

\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]

Ok, if you are provocating, I'll strike back: None of the definitions above are allowed because they don't warn the user if he's using the wrong name. They should throw an error instead. The only proper way would be to define something like

\setuplabeltext[\s!en][\v!pronouncemyname=moitsa] \setuplabeltext[\s!de][\v!pronouncemyname=mojza] \setuplabeltext[\s!ru][\v!pronouncemyname=мойца] ...

so how about using: \translate[en=moitsa,de=mojza,ru=мойца] then -)

...

OK. I'll prepare \defineregimesynonym-s proposals, but I still don't know what the file synonyms should be used for in this context. The user probably doesn't need to care about file names?

depends on if you want to preload all those vectors (take quite some memory although i may find a way around that [maybe delayed loading]

...

So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."

you mean ... \defineactivetoken 123 {\uchar{...}{...}} it is an option but it's much slower and take much more memory \uchar{2}{33} takes 1 hash pointer and 7 char slots (so probably 8 mem locations) while \eacute takes one mem location

...

Is it possible to switch the regimes in the middle of the document (like it is possible to switch the languages)? An example usage would be if some input documents (plain text, some older TeX files or database entries) are written in some other encoding than the main stream. (Possibly switching in such a way that no leftovers remain after the old encoding is replaced by a new one.)

switching is possible but in that case you probably want to set toc/index/etc expansion to yes Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Christopher Creutzig

20 Jul 20 Jul

10:35 p.m.

Hans Hagen wrote:

...

...
So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."

you mean ...

\defineactivetoken 123 {\uchar{...}{...}}

it is an option but it's much slower and take much more memory

I may be wrong, of course, but I think Mojca proposed something different (and something that should be really easy to implement): Have the unicode vectors stored in a format easily parsed by an external ruby script and create the regi-* files from that, using the conversion tables provided by your operating system or iconv or wherever ruby gets them from. regards, Christopher

Mojca Miklavec

21 Jul 21 Jul

2:52 a.m.

Christopher Creutzig wrote:

...

Hans Hagen wrote:

...
...
So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."

you mean ...

\defineactivetoken 123 {\uchar{...}{...}}

it is an option but it's much slower and take much more memory

I may be wrong, of course, but I think Mojca proposed something different (and something that should be really easy to implement): Have the unicode vectors stored in a format easily parsed by an external ruby script and create the regi-* files from that, using the conversion tables provided by your operating system or iconv or wherever ruby gets them from.

Yes, I had something different in mind. A1.) prepare the files to be used as a source of transformation from "any" character set to utf and prepare a list of synonyms for encodings (example: a file that says that in ISO-8859-2, character 0xA3 represents an unicode character 0x0141 (lstroke): for every character, for every Mac/Windows/iso/[...] encoding that we want to support) A2.) write a script which automatically generates regi-* files from those files, but regi-* files would contain only the mapping to unicode number (example: \startregime[iso-8859-2] ... \somecommandtomapacharactertounicode {163}{1}{65} % lstroke ... \stopregime) A3.) prepare a huge file with mapping from unicode numbers to ConTeXt commands (example: ... \somecommandtomapfromunicodetocontext {1}{65}{\lstroke} ...) A4.) ... I don't mind what ConTeXt does with this \lstroke afterwards, but it seems it is already clever enough to produce the (proper) glyph at the end What should ConTeXt do with that? B1.) The file under A3 should be processed at the beginning. As it may become really huge, exotic definitions should be only preloaded if asked for (\usemodule[korean]), while there is probably no harm if (accented) latin, greek, cyrillic and punctuation (TM, copyright, ..) are preloaded by default B2.) Once the \enableregime[iso-8859-2] or any other regime is requested, the file with the corresponding regime definitions is processed. However, as \somecommandtomapacharactertounicode {163}{1}{65} is processed, the character '163' is not stored as \uchar{1}{65}, but as \lstroke. '\somecommandtomapacharactertounicode' would first take a look which ConTeXt command is saved under \uchar{1}{65} and call the \defineactivetoken 179 {\lstroke} as a result. I don't know the details of the ConTeXt internal stuff, but I think (hope) that it should be possible to do it this way. B1 (preloading mapping from unicode to tex commands) is probably the only "hungry" step in the whole story. I think that it doesn't make any sense to ask the user to "\input regi-whatever". \enableregime and some additional definitions should be clever enough to find out which file to process in order to enable the proper regime. %%%%%%%%%%%%%%%%%%%%% Christopher's idea is actually yet another alternative, which combines the steps A2 and A3. If the mapping unicode->ConTeXt is in some easy-to-parse format, there's actually no additional effort if the script writes directly the ConTeXt commands instead of unicode numbers into regi-* files, so that B2 has some less work to do. As long as it is guaranteed that nobody will change these files manually, this is OK. The only drawback is that if someone notices that "\textellipsis" is more suitable than "\dots", the script has to be changed and the files have to be generated once more. If the character is mapped to (0x2026 HORIZONTAL ELLIPSIS) instead, only one line in the file with unicode->ConTeXt mapping (A3) has to be changed. If B2 cannot work as described, the Christopher's proposal would be the only proper way to go. %%%%%%%%%%%%%%%%%%%%% I wanted to test \showcharacters on the live.contextgarden.net (as Hans suggested that my map files are probably not OK), but it didn't compile there. (I hope it's not because of my buggy contributions in the last few days.) Is there any tool or macro to visialize all the glyphs available in a font? \showcharacters (if it works) shows only the glyphs that ConTeXt is aware of. What about the rest? Mojca

Christopher Creutzig

22 Jul 22 Jul

1:30 p.m.

Mojca Miklavec wrote:

...

A1.) prepare the files to be used as a source of transformation from "any" character set to utf and prepare a list of synonyms for encodings

In my point of view, that should only be a fallback. We already have Iconv in ruby and can, if we know that ISO-8859-2 is a single byte coding system, simply say conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] } to get the whole list, assuming we've filled the lookup hash first. As you've said, I'd combine steps A2 and A3, to make ConTeXt run faster. If you want, for whatever reason, to use \textellipsis for an ellipsis (it just looks horribly wrong to me) instead of \dots, you'd need to invoke the ruby script which generates the regi-* files. The whole thing should not require any change at all to ConTeXt itself, since the regi-* files could look exactly as they do now, just being generated automatically. (For the multibyte encodings, the whole thing gets much more tricky.) Christopher

Hans Hagen

2:05 p.m.

Christopher Creutzig wrote:

...

conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] }

to get the whole list, assuming we've filled the lookup hash first.

an alternative is to use the tcx files but that is kind of messy so we need a utf-8 hash (can be loaded from unic-* files) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

23 Jul 23 Jul

12:20 a.m.

Christopher Creutzig wrote:

...

We already have Iconv in ruby and can, if we know that ISO-8859-2 is a single byte coding system, simply say

conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] }

to get the whole list, assuming we've filled the lookup hash first.

Great! Sorry for all my philosophising! I don't know ruby (yet) and I didn't even think about this possibility. My last idea was to parse and combine the data on http://www.unicode.org/Public/MAPPINGS/VENDORS/, http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt, but your idea is hundred times faster and better! Thanks a lot!

...

As you've said, I'd combine steps A2 and A3, to make ConTeXt run faster.

That's OK for me. If there's a simple internal ruby tool (called every time when unicode->tex mapping changes or some more encoding support is added) instead of one-time-script, there should be no problem to do that directly.

...

If you want, for whatever reason, to use \textellipsis for an ellipsis (it just looks horribly wrong to me) instead of \dots, you'd need to invoke the ruby script which generates the regi-* files.

I just wanted to give an example that changes are sometimes needed and that it is difficult to trace all the places where they should have been made. Sorry, this example wasn't very ilustrative, I don't even know what \textellipses stands for, I just saw some comments about changes made in regi-* files or some discrepancies.

...

The whole thing should not require any change at all to ConTeXt itself, since the regi-* files could look exactly as they do now, just being generated automatically. (For the multibyte encodings, the whole thing gets much more tricky.)

I noticed (perhaps I'm wrong) that TeX community support for cyrillic may be better than that in unicode and in the available old 8bit encodings. ConTeXt is also already supporting those strange regimes (ctt, dbk, mls, mnk, mos, ncc, ...) that I was unable to find anywhere else. In this case one should also be careful in order not to spoil this already available feature. I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once? Mojca

Henning Hraban Ramm

25 Jul 25 Jul

5:58 p.m.

Am 2005-07-23 um 00:20 schrieb Mojca Miklavec:

...

I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once?

TeX as an old 8bit system isn't able to handle more than 256 chars per font. Only more modern siblings (like Omega/Aleph) are able to handle "Unicode size" fonts by itself. Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Hans Hagen

26 Jul 26 Jul

1:49 a.m.

Mojca Miklavec wrote:

...

I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once?

indeed and since it's related to hyphenation ... but some day pdftex will be 32 bit and open type so ... Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Hans Hagen

17 Jul 17 Jul

10:37 p.m.

Mojca Miklavec wrote:

...

I'm now attaching a file for support for windows-1250-encoded files. One character is missing (I don't know what to write for non-breaking space) and it's not extensively tested or proved for typos. So if someone can drop an eye on it, I'll be glad.

maybe a better name is regi-ce or just regi-1250

...

Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)?

there are \showcharacters \showaccents it all depends on the combination of input regime and font encoding Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Henning Hraban Ramm

11:51 p.m.

Am 2005-07-17 um 22:37 schrieb Hans Hagen:

...

there are

\showcharacters \showaccents

BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Hans Hagen

18 Jul 18 Jul

12:36 a.m.

Henning Hraban Ramm wrote:

...

Am 2005-07-17 um 22:37 schrieb Hans Hagen:

...
there are

\showcharacters \showaccents

BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all).

thanks (\trace... is also handy) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Henning Hraban Ramm

6:18 p.m.

New subject: Visual Debugging (was: Basic question)

Am 2005-07-18 um 00:36 schrieb Hans Hagen:

...

...
...
\showcharacters \showaccents BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). (\trace... is also handy)

Hm, there's no trace in texshow, but a lot of trace...true in the sources; hopefully I catched them all on http://contextgarden.net/ Visual_Debugging I found some \tracing in the jEdit xml, but nowhere in the sources, and some single \traced... (with d). Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net

Brooks Moses

10:44 p.m.

New subject: Visual Debugging (was: Basic question)

At 09:18 AM 7/18/2005, you wrote:

...

Am 2005-07-18 um 00:36 schrieb Hans Hagen:

...
...
...
\showcharacters \showaccents BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). (\trace... is also handy)

Hm, there's no trace in texshow, but a lot of trace...true in the sources; hopefully I catched them all on http://contextgarden.net/ Visual_Debugging I found some \tracing in the jEdit xml, but nowhere in the sources, and some single \traced... (with d).

I had also put some lists of all the \show... and \trace... commands I found in the sources on the Discussion page for Visual_Debugging; it may be useful to go through and compare our lists. Note that the \trace... commands seem to be defined with \newif; thus, they come in "\iftrace...", "\trace...true", and "\trace...false" trios. - Brooks

Hans Hagen

11:41 p.m.

New subject: Visual Debugging

Brooks Moses wrote:

...

I had also put some lists of all the \show... and \trace... commands I found in the sources on the Discussion page for Visual_Debugging; it may be useful to go through and compare our lists.

Note that the \trace... commands seem to be defined with \newif; thus, they come in "\iftrace...", "\trace...true", and "\trace...false" trios.

when the list we can see if some more consistency is needed Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

7295

Age (days ago)

7306

Last active (days ago)

List overview

Download

29 comments

8 participants

participants (8)

Brooks Moses
Christopher Creutzig
Hans Hagen
Henning Hraban Ramm
Mojca Miklavec
Radhelorn
Steffen Wolfrum
VnPenguin