6 Jun
2011
6 Jun
'11
10:29 a.m.
On Mon 06 Jun 2011, Arthur Reutenauer wrote:
Well, there *is* more than one way to represent รค in UTF-8
If you mean "non-shortest" forms such as 0xE0 0x83 0xA4 or 0xF0 0x80 0x83 0xA4, then no, they have been forbidden since Unicode 3 in 2000 (formally Corrigendum #1, see http://www.unicode.org/versions/corrigendum1.html).
I was actually thinking of precomposed vs. combining diacritics. I was blissfully unaware of the non-shortest-form problem up until now... Pont