On Sat, 5 Jan 2008 01:08:13 -0500, tedd wrote: > At 1:41 AM +0100 1/5/08, Nisse Engström wrote: >>On Fri, 4 Jan 2008 09:16:54 -0500, tedd wrote: >> >>> At 10:33 AM +0100 1/4/08, Nisse Engström wrote: >>>>On Thu, 3 Jan 2008 12:39:36 -0500, tedd wrote: >>> >> > Nisse: > > Thanks again for your time and guidance. If that's what you want to call my incoherent ramblings... :-) > As you said, it's my understanding that a web > page encoding can be designated via a meta > statement > > <meta http-equiv="content-type" content="text/html;charset=UTF-8"> The page encoding is determined by the HTTP `Content-Type:´ header. Period. A <meta> element may provide hints to a browser if the HTTP header is missing (eg. when saving a page to disc). In the presence of a `Content-Type:´ header, the <meta> element should be completely ignored. There have been reports about russian servers that transcode document from one encoding to another (and modifies the HTTP headers accordingly) on the fly, which means that the <meta> element is incorrect when it reaches the browser. NOTE: Richard Lynch tells us that (some versions of) Internet Explorer, under certain circumstances, ignore the `Content-Type:´ header and that a <meta> element is necessary to make it guess the encoding correctly. I don't know about this, but I tend to believe him. You should probably do what Richard says. > However, that might be different than how the page was actually saved. > > I have heard of instances where a disconnect like > that has caused problems with browsers and made > them kick into quirks mode, which also has > affected other things like javascript. I had one > javascript guru that kept hitting me over the > head with complaints that I was deliberately > doing it just to piss him off, but the truth was > I just didn't realize the problem -- still don't. Things like that can probably happen. > So, to cover all bases -- what's the best way to > set encoding in web page, to save correctly and > use a meta tag? And, what do you recommend to be > the "best" encoding to shoot for, UTF-8? UTF-8. The only character encodings that seem to be universally supported (if I recall correctly) are most ISO-8859 (eg `ISO-8859-1´) and Windows (eg. `cp1252´) encodings, and UTF-8. The advantage of UTF-8 is, of course, that it covers *a lot* more characters than any of the legacy 8-bit encodings. I think that support for UTF-16 (and UTF-32) is lacking, especially when it comes to form submission. UTF-8 also has some very nice properties: * Compatibility with ASCII. Even though it is a variable- length encoding, all ASCII characters (U+00 - U+7f) have identical encodings in ASCII and UTF-8. * Compatibility with 8-bit string functions. All multi- octet sequences (U+80 and up) contain only octets in the range <80> - <ff>, so there are no NULL or control characters embedded that can cause problems. Even string comparison works (unless you want to go whole hog with combining characters, character collation and what-not). * (And some more stuff that I'm too tired to remember.) > And lastly, what's the best encoding to set your > browser? I have clients who are all over the > place with special windoze characters that appear > like garbage in my browser. Set it to detect automatically, with a preference for cp1252 (or windows-1252) which covers a lot of western characters. cp1252 also has the nice property of being compatible with ISO-8859-1, except that it has some extra real characters where 8859-1 has control characters. I seems to me that when a page is served with an incorrect encoding or none at all, cp1252 is often the correct encoding. Of course, that may depend on which pages I tend to visit... This doesn't always work of course; Regardless of what you choose, you can probably find plenty of pages where that choice is wrong. Both Opera and FireFox will let you choose encoding on the fly through the menu bar. - - - And sometimes you can stumble upon a page where *every* choice is wrong: <http://www.w3.org/International/> Look at the various languages under "Links to Translation" in the yellowish column. Under "ar" (Arabic) you'll find the following characters (in UTF-8): <c3 98><c2 a7><c3 99><c2 84><c3 98><c2 b9><c3 98> <c2 b1><c3 98><c2 a8><c3 99><c2 8a><c3 98><c2 a9> Run this through a `UTF-8 to UTF-16´ converter (to make the Unicode code points easier to read) and you'll find: <00 d8><00 a7><00 d9><00 84><00 d8><00 b9><00 d8> <00 b1><00 d8><00 a8><00 d9><00 8a><00 d8><00 a9> Now, lose the zeroes... <d8 a7><d9 84><d8 b9><d8 b1><d8 a8><d9 8a><d8 a9> ...and run it *again* through the `UTF-8 to UTF-16´ converter... <06 27><06 44><06 39><06 31><06 28><06 4a><06 29> ...and you end up with some actual arabic characters! (The range U+0600 - U+06ff contain arabic characters.) It's the same for the other languages. And this on a page about internationalization no less. :-) > I read a book on Unicode and the book provided > considerable evidence of the complexities of > encoding. Now throw into the mix PUNYCODE for > IDNS and you have quite an assortment of problems > with rendering different code-points in different > char-sets. A very interesting topic. Very much so. [I'm off to work on some 'Last-Modified:' and 'If-Modified-Since:' logic which is another interesting topic.] /Nisse -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php