UTF8 encoding is rather simple, really: byte number: b1 b2 b3 b4 0 -- 127 = unicode 0x00 - 0x7F 192 -- 223 128 -- 191 = unicode 0x80 - 0x7FF 224 -- 239 128 -- 191 128 -- 191 = unicode 0x800 - 0xFFFF 240 -- 247 128 -- 191 128 -- 191 128 -- 191 = unicode 0x10000 - 0x1FFFF There are also sequences for 5 and 6 bytes, but these are illegal for Unicode representations at the moment: 248 -- 251 128 -- 191 128 -- 191 128 -- 191 128 -- 191 252 -- 253 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191 are illegal as first chars in UTF8 (that is handy for error-recovery): 254 and 255 are completely illegal and should not appear at all (if you see them, it's a safe bet that the document is encoded as UTF16, not UTF8): The unicode number for a UTF8 sequence can be calculated as: byte1 if byte1 <= 127 (byte1-192)*64 + (byte2-128) if 192 <= byte1 <= 223 (byte1-224)*4096 + (byte2-128)*64 + (byte3-128) if 224 <= byte1 <= 239 (byte3-240)*262144 + (byte2-128)*4096 + (byte3-128)*64 + (byte4-128) if 240<= byte1 <= 247 Simple, eh? -- groeten, Taco