On Wednesday 11 May 2005 07:43, Carl Furst wrote: > I have a question about an odd phenomenon. It doesn't have much to do with > PHP except that I used strtr to solve it, and it maybe that the problem is > being caused by a setting in PHP, but I would like to get some more > background info as to why this is happening. > > > > On a typical Windows system, most applications use the windows-1252 > character set. Linux uses UTF-8 or Unicode. The former being an 8 bit set > and the latter being a 16 bit set. > > > > Well I have a form on a website that has to be able to take in text from > MSWord and Notepad and the like. If someone has been using "Autoformating" > in MS Word, the "special characters" get translated into a UTF-8 > equivalent. What's odd is that these 8 bit windows characters become 24 bit > combinations, I think. When I look at the characters in hex they are > represented by 3 numbers first one always being 0xE2. > > > > Why is there an 0xE2 beginning the character combination and why does PHP > translate these characters this way? Is there something you can do to > minimize them besides writing some kind of character scrubber? If you check the UTF8 character set table at (http://www.unicode.org/charts/) you will see that the section for Basic Latin answers your question. > > > > Thanks, > > Carl -- Cyberly yours, Petar Nedyalkov Devoted Orbitel Fan :-) PGP ID: 7AE45436 PGP Public Key: http://bu.orbitel.bg/pgp/bu.asc PGP Fingerprint: 7923 8D52 B145 02E8 6F63 8BDA 2D3F 7C0B 7AE4 5436
Attachment:
pgpGGAb1x86ZR.pgp
Description: PGP signature