On 2019-03-17 15:01:40 +0000, Warner, Gary, Jr wrote: > Many of us have faced character encoding issues because we are not in control > of our input sources and made the common assumption that UTF-8 covers > everything. UTF-8 covers "everything" in the sense that there is a round-trip from each character in every commonly-used charset/encoding to Unicode and back. The actual code may of course be different. For example, the € sign is 0xA4 in iso-8859-15, but U+20AC in Unicode. So you need an encoding/decoding step. And "commonly-used" means just that. Unicode covers a lot of character sets, but it can't cover every character set ever invented (I invented my own character sets when I was sixteen. Nobody except me ever used them and they have long succumbed to bit rot). > In my lab, as an example, some of our social media posts have included ZawGyi > Burmese character sets rather than Unicode Burmese. (Because Myanmar developed > technology In a closed to the world environment, they made up their own > non-standard character set which is very common still in Mobile phones.). I'd be surprised if there was a character set which is "very common in Mobile phones", even in a relatively poor country like Myanmar. Does ZawGyi actually include characters which aren't in Unicode are are they just encoded differently? hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@xxxxxx | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment:
signature.asc
Description: PGP signature