Plea for unicode help
Hi there, I'm experiencing a very strange error related to unicode at the moment and I can't pin down the problem for the life of me… The situation: I'm using context mkvi (2011.04.20 16:23) on Mac OS 10.6.7 (with the latest font patches in particular) and TeXShop 2.41. In my tex source file I have unicode letters with accents (like "clarín") and the file's encoding is set to UTF-8. It still looks like valid UTF-8 when opened with a different text editor, say SubEthaEdit or vim. The problem is, context compiles this to a PDF in which the accent is missing ("clarin"). The really strange thing happens now. I delete the offending letter and reenter it, in TeXShop there's *no* visual difference between before and after but the compiled pdf suddenly has the accent enabled again. I seem to remember that the missing accents issue didn't occur with a different font (Latin Modern vs. Minion Pro in context) for the *same* source. I'll have to check that again though. Has anyone seen this before? I wanted to ask up front before I really start digging into the issue… I might have missed something obvious. Many thanks, Oliver
On 05/05/2011 01:32 PM, Oliver Buerschaper wrote:
Has anyone seen this before? I wanted to ask up front before I really start digging into the issue… I might have missed something obvious.
Check the hexdump of the file. Chances are that one of them has í directly, and one a combination of <dotlessi><acuteaccent>. Best wishes, Taco
Has anyone seen this before? I wanted to ask up front before I really start digging into the issue… I might have missed something obvious.
Check the hexdump of the file. Chances are that one of them has í directly, and one a combination of <dotlessi><acuteaccent>.
Awesome hint… hits the nail on the head! The "faulty" version (i.e. the one not appearing in the PDF with Minion Pro) is <dotlessi><acuteaccent> (where <acuteaccent> appears to translate to CC81 in hex, correct?). I guess I need to find and replace the accent combination by the direct slot? Can something similar happen for other "foreign" characters (like ß, umlauts, ae, etc.) or is this sort of error only possible with accents? Oliver
On 05/05/2011 03:52 PM, Oliver Buerschaper wrote:
Has anyone seen this before? I wanted to ask up front before I really start digging into the issue… I might have missed something obvious.
Check the hexdump of the file. Chances are that one of them has í directly, and one a combination of<dotlessi><acuteaccent>.
Awesome hint… hits the nail on the head! The "faulty" version (i.e. the one not appearing in the PDF with Minion Pro) is<dotlessi><acuteaccent> (where<acuteaccent> appears to translate to CC81 in hex, correct?).
Yes. Useful site for find out stuff like that without having to do utf-8 calculations yourself: http://www.decodeunicode.org/en/u+0301/properties At the top right, it has numerical values for the current character in various encodings.
I guess I need to find and replace the accent combination by the direct slot?
That would be wise for now, but I think context should be able to trap this automatically (at least in the mode=node case).
Can something similar happen for other "foreign" characters (like ß, umlauts, ae, etc.) or is this sort of error only possible with accents?
IIRC, in principle it can happen with some other characters as well, but I do not think that happens often. It is mostly combining accents. Best wishes, Taco
Awesome hint… hits the nail on the head! The "faulty" version (i.e. the one not appearing in the PDF with Minion Pro) is<dotlessi><acuteaccent> (where<acuteaccent> appears to translate to CC81 in hex, correct?).
Yes. Useful site for find out stuff like that without having to do utf-8 calculations yourself:
http://www.decodeunicode.org/en/u+0301/properties
At the top right, it has numerical values for the current character in various encodings.
This page looks great. Jotted down for later reading ;-)
I guess I need to find and replace the accent combination by the direct slot?
That would be wise for now, but I think context should be able to trap this automatically (at least in the mode=node case).
Sounds reasonable. By the way, is the direct encoding generally preferred over the combination method (say, by good Unicode practice ;-)? If yes, I certainly wouldn't mind a little warning message if I happen to use the other variant…
Can something similar happen for other "foreign" characters (like ß, umlauts, ae, etc.) or is this sort of error only possible with accents?
IIRC, in principle it can happen with some other characters as well, but I do not think that happens often. It is mostly combining accents.
I see. So umlauts are good candidates to check, too. Thanks again, Oliver
participants (2)
-
Oliver Buerschaper
-
Taco Hoekwater