There are charts you should be able to find giving codepoints and the character they represent. If you know C I can give you code to demonstrate how to get the codepoint from a 2 3 or 4 byte utf-8 sequence. Basically when you see an extended ascii character you determine how many leading 1 bits there are. If there are 3 for example then sequence should be a 3-byte utf8. You then check the second and third byte to see that B15 is 1 and B14 is zero. Then you concatenate the bits and come up with the codepoint. Quite complicated!