Hi, now and then I saw threads on this list dealing with specific problems of using various languages with utf-8 input in ConTeXt (processing with pdftex, NOT xetex). I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input? Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language? (If the platform makes a difference, I'd be interested in OSX.) Thank you, Steffen
Am 2005-07-14 um 11:30 schrieb Steffen Wolfrum:
I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?
Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?
You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki! Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
Hi Henning,
Zitat von Henning Hraban Ramm
Am 2005-07-14 um 11:30 schrieb Steffen Wolfrum:
I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?
Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?
You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!
Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more: But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic) Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding? Shouldn't all UTF relevant examples be listed under UTF? So,sorry for starting this irrelevant thread, Steffen
On 7/14/05, Steffen Wolfrum
Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more:
But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)
Sorry, I can not understand your question. Vietnamese can you TeX/LaTeX and ConTeXt with different input encodings: TCVN, VISCII, VPS or UTF-8. I'm using currently ConTeXt UTF-8 input for ConTeXt no problem. Not yet tested with another input encoding, but no more problem with TeX/LaTeX, so should be ok with ConTeXt, i'm wrong? -- http://vnoss.org Vietnamese Open Source Software Community
Steffen Wolfrum wrote:
I know there is \enableregime[utf] but what else I needed that the output equals my utf-8 input?
Could some maybe give a short and usable How-To on common examples: Greek Russian an East European language and an Asian language?
You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!
Well, yes, I wasn't interested in e.g. VISCII, but I read the info for UTF. But as you wrote "linked pages" I became more curious and looked up also those pages. Indeed, there is more:
But, why is the Vietnamese example with \enableregime[utf] \setupencoding[default=t5 linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)
Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding?
Shouldn't all UTF relevant examples be listed under UTF?
\enableregime is not enough. You need to setup font encoding and
appropriate bodyfont. For these see type-enc, type-pre and such.
Example for cyrillic:
\enableregime [utf]
\setupencoding [default=t2a]
\usetypescript [modern-base] [\defaultencoding]
\setupbodyfont [modern]
\starttext
Тест.
\stoptext
--
Radhelorn
Am 2005-07-14 um 21:13 schrieb Steffen Wolfrum:
But, why is the Vietnamese example with \enableregime[utf] linked under vis = viscii VISCII Vietnamesevis = viscii VISCII Vietnamese and not accessable with utf UTF-8 Unicode ? (Same for cyrillic)
Is this just a wrong link, or does this show that I don't have understood the realationship between regimes and encoding? Shouldn't all UTF relevant examples be listed under UTF?
All examples are (could be) relevant for UTF-8, because you can set (nearly) everything in Unicode. VISCII is one possible encoding for Vietnamese (and only for Vietnamese), so I found it rather logical to link from there to V., even if the V. example uses UTF-8, which is probably more modern - as probably a lot of other encodings are obsolete/deprecated. So, even if the V. example could be considered a general UTF-8 example, it shows how one can (and perhaps should) typeset V. So I guess the only error or missing link is the link from UTF-8 to Vietnamese (and Cyrillic). Do it yourself as you please. Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
Henning Hraban Ramm wrote:
You did read http://contextgarden.net/Encodings_and_Regimes and linked pages, did you? If you learn anything new, please add it to the wiki!
Thank you! It was probably me who copy-pasted some of the material there from some thread, but when I looked at it once again, I learnt something new. A while ago I was asking how to typeset things in windows-1250 encoding (\usepackage[cp1250]{inputenc} in LaTeX). I got some answer (just a temporary solution with csr fonts), but it was not a satisfying one. I'm now attaching a file for support for windows-1250-encoded files. One character is missing (I don't know what to write for non-breaking space) and it's not extensively tested or proved for typos. So if someone can drop an eye on it, I'll be glad. Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)? regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too. \defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ} This should be \dstroke and \Dstroke. Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead. Mojca
Am 2005-07-15 um 20:43 schrieb Mojca Miklavec:
Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead.
AFAIK the name is PostScript standard - Adobe used some strange names... Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
(Sorry, I should have opened another thread already before.) I have another couple of questions about regimes support. How can synonyms for regimes be defined, so that \enableregime[windows-1250] would have the same effect as \enableregime[win-1250] or \enableregime[cp1250]? And \enableregime[utf8] the same effect as \enableregime[utf]. I don't won't to be discriminating, but \enableregime[windows] is like writing \enableregime[latin] ("il" in ConTeXt I think) and expecting the whole world to understand that you mean latin1. In my opinion it should be left there (for backward compatibility if for nothing else), but deprecated and given an unambigious name like "windows-1252", "windows1252", "win-1252", "win1252", "cp1252" or "windows-western".
Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)?
(Seems like I should have learnt for my math exam tomorrow instead of writing this.) I meant if someone has a nice macro or a script to produce the 256 (not 266!) characters table (minus non-printable ones), maybe together with the corresponding name (only if it can still be extracted). It should either look like an usual ASCII table (perhaps with a box around like in TeX font tables) or simply one character per line with a decimal and hex number written. More or less in order to be able to test if the regi-* files are OK. I prepared the file by hand, but now as I know where to look for and after I saw the http://czyborra.com/charsets/iso8859.html page, I think it shouldn't be a problem to prepare support for all those usual and unusual encodings at once (only a clever script and some manually-prepared mapping from unicode to ConTeXt names). Unicode is great, but not everyone uses it (even vim behaves pretty system-dependent and cannot always be used for unicode out-of-the-box). (I also forgot to thank Patrick for explaining me some stuff about regimes.) Mojca
Mojca Miklavec wrote:
I have another couple of questions about regimes support.
How can synonyms for regimes be defined, so that \enableregime[windows-1250] would have the same effect as \enableregime[win-1250] or \enableregime[cp1250]? And \enableregime[utf8] the same effect as \enableregime[utf].
I don't won't to be discriminating, but \enableregime[windows] is like writing \enableregime[latin] ("il" in ConTeXt I think) and expecting the whole world to understand that you mean latin1. In my opinion it should be left there (for backward compatibility if for nothing else), but deprecated and given an unambigious name like "windows-1252", "windows1252", "win-1252", "win1252", "cp1252" or "windows-western".
I'll send you a few lines of code to test Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Mojca Miklavec wrote:
regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.
\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}
This should be \dstroke and \Dstroke.
ok, changed
Where did the "hungarumlaut" characters get the name from? Woudn't it be better to have "doubleaccute" (as in UNICODE standard). We also don't name the characters "germanumlaut" but "diaeresis" instead.
the names probably come from postscript btw, there is a differnece between umlaut and diaeresis (height) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 7/17/05, Hans Hagen
Mojca Miklavec wrote:
regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.
\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}
This should be \dstroke and \Dstroke.
ok, changed
yes, there are also exactly two glyphs \dstroke and \Dstroke in Vietnamese :) Cheers, -- http://vnoss.org Vietnamese Open Source Software Community
Hans Hagen wrote:
Mojca Miklavec wrote:
regi-lat.tex is interesting, made just for typesetting Croatian :) Perhaps I can add some stuff there too.
\defineactivetoken đ {\pseudoencodeddj} \defineactivetoken Ð {\pseudoencodedDJ}
This should be \dstroke and \Dstroke.
ok, changed
Thank you. \Dstroke has some "problems" anyway, at least in cmr (lmr?). The stroke should be on the left, but it is on the right. I thought it was just because \tt don't have that glyph, but also the roman version is rendered extremely bad.
Where did the "hungarumlaut" characters get the name from?
the names probably come from postscript
Thanks, I looked into some .afm files and they were actually there.
btw, there is a differnece between umlaut and diaeresis (height)
So what is the proper way of writing 'ä' (a umlaut) then?
can't you make it into a
\defineactivetoken 128 {\texteuro} % € 20AC EURO SIGN
kind of table?
Good idea indeed, it looks much nicer this way.
maybe a better name is regi-ce or just regi-1250
regi-ce is a bad name as there are 4 central european encodings (IBM-853, ISO-8859-2, MacCE and Windows-1250) plus Croatian. 1250 alone is probably OK, but there's no hint in file name about which encoding is meant (windows/ibm/iso/mac ...). I tested the code for regime synonyms and it looks OK. Thanks for investingating my request :)
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)
What do you mean by writing file synonyms? Where would it be used? For unicode regimes, this is probably an useful (more or less complete) set. \defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf] \defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf] (Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.) For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.
there are
\showcharacters \showaccents
Thank you. The commands were only kind-of-working here. They produced the table that I wanted (and quite some trash as well), but they were complaining a lot. Thanks for the contribution into Visual debugging, Hraban! What's the proper name for nonbreaking space, '~', to be put in regi-* file? Mojca
Mojca Miklavec wrote:
\Dstroke has some "problems" anyway, at least in cmr (lmr?). The stroke should be on the left, but it is on the right. I thought it was just because \tt don't have that glyph, but also the roman version is rendered extremely bad.
in case of doubt, you can discuss this with Boguslaw Jackowski (jacko) who is in charge of latin modern; it shoul dbe ok in latin roman
So what is the proper way of writing 'ä' (a umlaut) then?
in german mode, "u will produce it (tricky since ther eis no hyphenation then) latin modern did have them and there is a special encoding vector in the context distribution (awaiting for those umlaust to show up again) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Mojca Miklavec wrote:
maybe a better name is regi-ce or just regi-1250
regi-ce is a bad name as there are 4 central european encodings (IBM-853, ISO-8859-2, MacCE and Windows-1250) plus Croatian. 1250 alone is probably OK, but there's no hint in file name about which encoding is meant (windows/ibm/iso/mac ...).
I tested the code for regime synonyms and it looks OK. Thanks for investingating my request :)
ok, i'll add it to enco-ini then
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)
What do you mean by writing file synonyms? Where would it be used?
\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]
For unicode regimes, this is probably an useful (more or less complete) set.
\defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf]
the spacy one does not make much sense
\defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf]
not sure about this one
(Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.)
For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.
indeed, i'll wait patiently for your complete list of synonyms
there are
\showcharacters \showaccents
Thank you. The commands were only kind-of-working here. They produced the table that I wanted (and quite some trash as well), but they were complaining a lot.
Thanks for the contribution into Visual debugging, Hraban!
What's the proper name for nonbreaking space, '~', to be put in regi-* file?
how about \nonbreakablespace Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen wrote:
Mojca Miklavec wrote:
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)
What do you mean by writing file synonyms? Where would it be used?
\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]
Ok, if you are provocating, I'll strike back: None of the definitions above are allowed because they don't warn the user if he's using the wrong name. They should throw an error instead. The only proper way would be to define something like \setuplabeltext[\s!en][\v!pronouncemyname=moitsa] \setuplabeltext[\s!de][\v!pronouncemyname=mojza] \setuplabeltext[\s!ru][\v!pronouncemyname=мойца] ...
For unicode regimes, this is probably an useful (more or less complete) set.
\defineregimesynonym[utf8][utf] \defineregimesynonym[utf 8][utf]
the spacy one does not make much sense
\defineregimesynonym[utf-8][utf] \defineregimesynonym[unicode][utf]
not sure about this one
Me neither, but "utf" alone is just as doubtful as this one. However, leaving utf-8 and utf8 only is OK.
(Btw, I tried all the four before I got the answer on the mailing list that I should use 'utf' instead.)
For the rest of the regimes I have to take a look first, so that I don't say anything wrong. There has to be only one clear scheme.
indeed, i'll wait patiently for your complete list of synonyms
OK. I'll prepare \defineregimesynonym-s proposals, but I still don't know what the file synonyms should be used for in this context. The user probably doesn't need to care about file names?
What's the proper name for nonbreaking space, '~', to be put in regi-* file?
how about \nonbreakablespace
Thanks. There was no such glyph in \showcharacters -) (PS: I'm sorry for accusing the innocent commands of \showcharacters and \showaccents for the missfunctionality. I accidentaly placed them after an \obeylines command as I was debugging some files. They couldn't have worked there anyway.) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% I wanted to post this in another thread, but it probably still fits on this place: The regi-* files currently map characters from individual encodings directly into \TeXcommands. But unicode is already supported in ConTeXt and the mappings from single file encodings into unicode are pretty well defined (perhaps there are some exceptions?) and can be obtained elsewhere on the internet. On the other hand, mapping from unicode to \TeXcommands is much less straightforward and sometimes subjective. I noticed some comments in regi-* files like % \texttrademark changed to \trademark or % \dots changed to \textellipsis The one who does the changes like that probably does them only in one file, the rest remains as is (and probably becomes deprecated if not unfunctional one day). On the other hand, there are around ten different cyrilic encodings (mostly they are already supported by ConTeXt, but anyway) and many other encodings in other languages as well. This means that the same cyrilic letter has to be assigned the name in ten files (regimes), possibly manually. So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later." Is it possible to switch the regimes in the middle of the document (like it is possible to switch the languages)? An example usage would be if some input documents (plain text, some older TeX files or database entries) are written in some other encoding than the main stream. (Possibly switching in such a way that no leftovers remain after the old encoding is replaced by a new one.) Mojca
Mojca Miklavec wrote:
Hans Hagen wrote:
Mojca Miklavec wrote:
(concerning eregi-* files: you can define filesynonyms so we need a list of filesynonyms and regimesynonyms)
What do you mean by writing file synonyms? Where would it be used?
\definefilesynonym [mojka] [mojca] \definefilesynonym [moika] [mojca] \definefilesynonym [moica] [mojca]
Ok, if you are provocating, I'll strike back: None of the definitions above are allowed because they don't warn the user if he's using the wrong name. They should throw an error instead. The only proper way would be to define something like
\setuplabeltext[\s!en][\v!pronouncemyname=moitsa] \setuplabeltext[\s!de][\v!pronouncemyname=mojza] \setuplabeltext[\s!ru][\v!pronouncemyname=мойца] ...
so how about using: \translate[en=moitsa,de=mojza,ru=мойца] then -)
OK. I'll prepare \defineregimesynonym-s proposals, but I still don't know what the file synonyms should be used for in this context. The user probably doesn't need to care about file names?
depends on if you want to preload all those vectors (take quite some memory although i may find a way around that [maybe delayed loading]
So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."
you mean ... \defineactivetoken 123 {\uchar{...}{...}} it is an option but it's much slower and take much more memory \uchar{2}{33} takes 1 hash pointer and 7 char slots (so probably 8 mem locations) while \eacute takes one mem location
Is it possible to switch the regimes in the middle of the document (like it is possible to switch the languages)? An example usage would be if some input documents (plain text, some older TeX files or database entries) are written in some other encoding than the main stream. (Possibly switching in such a way that no leftovers remain after the old encoding is replaced by a new one.)
switching is possible but in that case you probably want to set toc/index/etc expansion to yes Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen wrote:
So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."
you mean ...
\defineactivetoken 123 {\uchar{...}{...}}
it is an option but it's much slower and take much more memory
I may be wrong, of course, but I think Mojca proposed something different (and something that should be really easy to implement): Have the unicode vectors stored in a format easily parsed by an external ruby script and create the regi-* files from that, using the conversion tables provided by your operating system or iconv or wherever ruby gets them from. regards, Christopher
Christopher Creutzig wrote:
Hans Hagen wrote:
So why not mapping the characters to unicode first and defining the mapping from unicode to \TeXcommand only once? regi-* files (at least in the meaning they have now) could be prepared automatically by a script, less error-prone and without the need to say "Some more definitions will be added later."
you mean ...
\defineactivetoken 123 {\uchar{...}{...}}
it is an option but it's much slower and take much more memory
I may be wrong, of course, but I think Mojca proposed something different (and something that should be really easy to implement): Have the unicode vectors stored in a format easily parsed by an external ruby script and create the regi-* files from that, using the conversion tables provided by your operating system or iconv or wherever ruby gets them from.
Yes, I had something different in mind. A1.) prepare the files to be used as a source of transformation from "any" character set to utf and prepare a list of synonyms for encodings (example: a file that says that in ISO-8859-2, character 0xA3 represents an unicode character 0x0141 (lstroke): for every character, for every Mac/Windows/iso/[...] encoding that we want to support) A2.) write a script which automatically generates regi-* files from those files, but regi-* files would contain only the mapping to unicode number (example: \startregime[iso-8859-2] ... \somecommandtomapacharactertounicode {163}{1}{65} % lstroke ... \stopregime) A3.) prepare a huge file with mapping from unicode numbers to ConTeXt commands (example: ... \somecommandtomapfromunicodetocontext {1}{65}{\lstroke} ...) A4.) ... I don't mind what ConTeXt does with this \lstroke afterwards, but it seems it is already clever enough to produce the (proper) glyph at the end What should ConTeXt do with that? B1.) The file under A3 should be processed at the beginning. As it may become really huge, exotic definitions should be only preloaded if asked for (\usemodule[korean]), while there is probably no harm if (accented) latin, greek, cyrillic and punctuation (TM, copyright, ..) are preloaded by default B2.) Once the \enableregime[iso-8859-2] or any other regime is requested, the file with the corresponding regime definitions is processed. However, as \somecommandtomapacharactertounicode {163}{1}{65} is processed, the character '163' is not stored as \uchar{1}{65}, but as \lstroke. '\somecommandtomapacharactertounicode' would first take a look which ConTeXt command is saved under \uchar{1}{65} and call the \defineactivetoken 179 {\lstroke} as a result. I don't know the details of the ConTeXt internal stuff, but I think (hope) that it should be possible to do it this way. B1 (preloading mapping from unicode to tex commands) is probably the only "hungry" step in the whole story. I think that it doesn't make any sense to ask the user to "\input regi-whatever". \enableregime and some additional definitions should be clever enough to find out which file to process in order to enable the proper regime. %%%%%%%%%%%%%%%%%%%%% Christopher's idea is actually yet another alternative, which combines the steps A2 and A3. If the mapping unicode->ConTeXt is in some easy-to-parse format, there's actually no additional effort if the script writes directly the ConTeXt commands instead of unicode numbers into regi-* files, so that B2 has some less work to do. As long as it is guaranteed that nobody will change these files manually, this is OK. The only drawback is that if someone notices that "\textellipsis" is more suitable than "\dots", the script has to be changed and the files have to be generated once more. If the character is mapped to (0x2026 HORIZONTAL ELLIPSIS) instead, only one line in the file with unicode->ConTeXt mapping (A3) has to be changed. If B2 cannot work as described, the Christopher's proposal would be the only proper way to go. %%%%%%%%%%%%%%%%%%%%% I wanted to test \showcharacters on the live.contextgarden.net (as Hans suggested that my map files are probably not OK), but it didn't compile there. (I hope it's not because of my buggy contributions in the last few days.) Is there any tool or macro to visialize all the glyphs available in a font? \showcharacters (if it works) shows only the glyphs that ConTeXt is aware of. What about the rest? Mojca
Mojca Miklavec wrote:
A1.) prepare the files to be used as a source of transformation from "any" character set to utf and prepare a list of synonyms for encodings
In my point of view, that should only be a fallback. We already have Iconv in ruby and can, if we know that ISO-8859-2 is a single byte coding system, simply say conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] } to get the whole list, assuming we've filled the lookup hash first. As you've said, I'd combine steps A2 and A3, to make ConTeXt run faster. If you want, for whatever reason, to use \textellipsis for an ellipsis (it just looks horribly wrong to me) instead of \dots, you'd need to invoke the ruby script which generates the regi-* files. The whole thing should not require any change at all to ConTeXt itself, since the regi-* files could look exactly as they do now, just being generated automatically. (For the multibyte encodings, the whole thing gets much more tricky.) Christopher
Christopher Creutzig wrote:
conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] }
to get the whole list, assuming we've filled the lookup hash first.
an alternative is to use the tcx files but that is kind of messy so we need a utf-8 hash (can be loaded from unic-* files) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Christopher Creutzig wrote:
We already have Iconv in ruby and can, if we know that ISO-8859-2 is a single byte coding system, simply say
conv = Iconv.new("UTF-16", "ISO-8859-2") 255.times { |i| puts lookup[conv.iconv("%c" % i)] }
to get the whole list, assuming we've filled the lookup hash first.
Great! Sorry for all my philosophising! I don't know ruby (yet) and I didn't even think about this possibility. My last idea was to parse and combine the data on http://www.unicode.org/Public/MAPPINGS/VENDORS/, http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt, but your idea is hundred times faster and better! Thanks a lot!
As you've said, I'd combine steps A2 and A3, to make ConTeXt run faster.
That's OK for me. If there's a simple internal ruby tool (called every time when unicode->tex mapping changes or some more encoding support is added) instead of one-time-script, there should be no problem to do that directly.
If you want, for whatever reason, to use \textellipsis for an ellipsis (it just looks horribly wrong to me) instead of \dots, you'd need to invoke the ruby script which generates the regi-* files.
I just wanted to give an example that changes are sometimes needed and that it is difficult to trace all the places where they should have been made. Sorry, this example wasn't very ilustrative, I don't even know what \textellipses stands for, I just saw some comments about changes made in regi-* files or some discrepancies.
The whole thing should not require any change at all to ConTeXt itself, since the regi-* files could look exactly as they do now, just being generated automatically. (For the multibyte encodings, the whole thing gets much more tricky.)
I noticed (perhaps I'm wrong) that TeX community support for cyrillic may be better than that in unicode and in the available old 8bit encodings. ConTeXt is also already supporting those strange regimes (ctt, dbk, mls, mnk, mos, ncc, ...) that I was unable to find anywhere else. In this case one should also be careful in order not to spoil this already available feature. I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once? Mojca
Am 2005-07-23 um 00:20 schrieb Mojca Miklavec:
I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once?
TeX as an old 8bit system isn't able to handle more than 256 chars per font. Only more modern siblings (like Omega/Aleph) are able to handle "Unicode size" fonts by itself. Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
Mojca Miklavec wrote:
I'm still slighlty confused by the encoding files (texnansi, ec,..., in one case iso-8859-7 is used). Does it mean that it is impossible (or at least very complex or slow) to access more than 256 characters from a single font at once?
indeed and since it's related to hyphenation ... but some day pdftex will be 32 bit and open type so ... Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Mojca Miklavec wrote:
I'm now attaching a file for support for windows-1250-encoded files. One character is missing (I don't know what to write for non-breaking space) and it's not extensively tested or proved for typos. So if someone can drop an eye on it, I'll be glad.
maybe a better name is regi-ce or just regi-1250
Does anyone have any script to test the encoding (which would produce a matrix of (almost) 266 characters)?
there are \showcharacters \showaccents it all depends on the combination of input regime and font encoding Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Am 2005-07-17 um 22:37 schrieb Hans Hagen:
there are
\showcharacters \showaccents
BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
Henning Hraban Ramm wrote:
Am 2005-07-17 um 22:37 schrieb Hans Hagen:
there are
\showcharacters \showaccents
BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all).
thanks (\trace... is also handy) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Am 2005-07-18 um 00:36 schrieb Hans Hagen:
\showcharacters \showaccents BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). (\trace... is also handy)
Hm, there's no trace in texshow, but a lot of trace...true in the sources; hopefully I catched them all on http://contextgarden.net/ Visual_Debugging I found some \tracing in the jEdit xml, but nowhere in the sources, and some single \traced... (with d). Grüßlis vom Hraban! --- http://www.fiee.net/texnique/ http://contextgarden.net
At 09:18 AM 7/18/2005, you wrote:
Am 2005-07-18 um 00:36 schrieb Hans Hagen:
\showcharacters \showaccents BTW I finally created the wiki page "Visual Debugging" for all the \show... commands; I guess there are even more than I listed there, and some descriptions are still missing (had no time to try them all). (\trace... is also handy)
Hm, there's no trace in texshow, but a lot of trace...true in the sources; hopefully I catched them all on http://contextgarden.net/ Visual_Debugging I found some \tracing in the jEdit xml, but nowhere in the sources, and some single \traced... (with d).
I had also put some lists of all the \show... and \trace... commands I found in the sources on the Discussion page for Visual_Debugging; it may be useful to go through and compare our lists. Note that the \trace... commands seem to be defined with \newif; thus, they come in "\iftrace...", "\trace...true", and "\trace...false" trios. - Brooks
Brooks Moses wrote:
I had also put some lists of all the \show... and \trace... commands I found in the sources on the Discussion page for Visual_Debugging; it may be useful to go through and compare our lists.
Note that the \trace... commands seem to be defined with \newif; thus, they come in "\iftrace...", "\trace...true", and "\trace...false" trios.
when the list we can see if some more consistency is needed Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (8)
-
Brooks Moses
-
Christopher Creutzig
-
Hans Hagen
-
Henning Hraban Ramm
-
Mojca Miklavec
-
Radhelorn
-
Steffen Wolfrum
-
VnPenguin