Re: First stupid post of the year. [SOLVED]

Nisse Engström <news.NOSPAM.0ixbtqKe@xxxxxxxx> · Sat, 5 Jan 2008 23:04:18 +0100

On Sat, 5 Jan 2008 01:08:13 -0500, tedd wrote:

> At 1:41 AM +0100 1/5/08, Nisse Engström wrote:
>>On Fri, 4 Jan 2008 09:16:54 -0500, tedd wrote:
>>
>>>  At 10:33 AM +0100 1/4/08, Nisse Engström wrote:
>>>>On Thu, 3 Jan 2008 12:39:36 -0500, tedd wrote:
>>>
>>  > Nisse:
> 
> Thanks again for your time and guidance.

   If that's what you want to call my incoherent
ramblings... :-)

> As you said, it's my understanding that a web 
> page encoding can be designated via a meta 
> statement
> 
> <meta http-equiv="content-type" content="text/html;charset=UTF-8">

   The page encoding is determined by the HTTP
`Content-Type:´ header. Period. A <meta> element
may provide hints to a browser if the HTTP header
is missing (eg. when saving a page to disc). In the
presence of a `Content-Type:´ header, the <meta>
element should be completely ignored.

   There have been reports about russian servers that
transcode document from one encoding to another (and
modifies the HTTP headers accordingly) on the fly,
which means that the <meta> element is incorrect when
it reaches the browser.

   NOTE: Richard Lynch tells us that (some versions of)
Internet Explorer, under certain circumstances, ignore
the `Content-Type:´ header and that a <meta> element is
necessary to make it guess the encoding correctly. I
don't know about this, but I tend to believe him.

   You should probably do what Richard says.

> However, that might be different than how the page was actually saved.
> 
> I have heard of instances where a disconnect like 
> that has caused problems with browsers and made 
> them kick into quirks mode, which also has 
> affected other things like javascript. I had one 
> javascript guru that kept hitting me over the 
> head with complaints that I was deliberately 
> doing it just to piss him off, but the truth was 
> I just didn't realize the problem -- still don't.

   Things like that can probably happen.

> So, to cover all bases -- what's the best way to 
> set encoding in web page, to save correctly and 
> use a meta tag? And, what do you recommend to be 
> the "best" encoding to shoot for, UTF-8?

   UTF-8.

   The only character encodings that seem to be universally
supported (if I recall correctly) are most ISO-8859 (eg
`ISO-8859-1´) and Windows (eg. `cp1252´) encodings, and
UTF-8. The advantage of UTF-8 is, of course, that it
covers *a lot* more characters than any of the legacy
8-bit encodings.

   I think that support for UTF-16 (and UTF-32) is
lacking, especially when it comes to form submission.

   UTF-8 also has some very nice properties:

* Compatibility with ASCII. Even though it is a variable-
  length encoding, all ASCII characters (U+00 - U+7f)
  have identical encodings in ASCII and UTF-8.
* Compatibility with 8-bit string functions. All multi-
  octet sequences (U+80 and up) contain only octets in
  the range <80> - <ff>, so there are no NULL or control
  characters embedded that can cause problems. Even
  string comparison works (unless you want to go whole
  hog with combining characters, character collation
  and what-not).
* (And some more stuff that I'm too tired to remember.)

> And lastly, what's the best encoding to set your 
> browser? I have clients who are all over the 
> place with special windoze characters that appear 
> like garbage in my browser.

   Set it to detect automatically, with a preference
for cp1252 (or windows-1252) which covers a lot of
western characters. cp1252 also has the nice property
of being compatible with ISO-8859-1, except that it
has some extra real characters where 8859-1 has control
characters.

   I seems to me that when a page is served with an
incorrect encoding or none at all, cp1252 is often
the correct encoding. Of course, that may depend on
which pages I tend to visit...

   This doesn't always work of course; Regardless of
what you choose, you can probably find plenty of pages
where that choice is wrong. Both Opera and FireFox will
let you choose encoding on the fly through the menu bar.

  -   -   -  

   And sometimes you can stumble upon a page where
*every* choice is wrong:

<http://www.w3.org/International/>

Look at the various languages under "Links to Translation"
in the yellowish column. Under "ar" (Arabic) you'll find
the following characters (in UTF-8):

   <c3 98><c2 a7><c3 99><c2 84><c3 98><c2 b9><c3 98>
   <c2 b1><c3 98><c2 a8><c3 99><c2 8a><c3 98><c2 a9>

Run this through a `UTF-8 to UTF-16´ converter (to make
the Unicode code points easier to read) and you'll find:

   <00 d8><00 a7><00 d9><00 84><00 d8><00 b9><00 d8>
   <00 b1><00 d8><00 a8><00 d9><00 8a><00 d8><00 a9>

Now, lose the zeroes...

   <d8 a7><d9 84><d8 b9><d8 b1><d8 a8><d9 8a><d8 a9>

...and run it *again* through the `UTF-8 to UTF-16´
converter...

   <06 27><06 44><06 39><06 31><06 28><06 4a><06 29>

...and you end up with some actual arabic characters!
(The range U+0600 - U+06ff contain arabic characters.)
It's the same for the other languages.

And this on a page about internationalization no less. :-)

> I read a book on Unicode and the book provided 
> considerable evidence of the complexities of 
> encoding. Now throw into the mix PUNYCODE for 
> IDNS and you have quite an assortment of problems 
> with rendering different code-points in different 
> char-sets. A very interesting topic.

   Very much so.

[I'm off to work on some 'Last-Modified:' and
 'If-Modified-Since:' logic which is another
 interesting topic.]

/Nisse

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php