Re: what is the charset of a URL ?

Sean Conner <spc@xxxxxxxxxx> · Mon, 9 Feb 2009 20:37:38 -0500

It was thus said that the Great Andr Warnier once stated:
> Hi.
> 
> Some people (to which I belong), after trying to digest the various RFCs 
> and other recommendations that seem to deal with the subject (e.g. 
> RFC3986 and the document above), come to the conclusion that the 
> character set and/or encoding of the query string, after 
> percent-decoding, is basically undefined from a server's point of view.
> Others seem to be convinced that it is Unicode encoded as UTF-8.
> Yet others that it is, by default, iso-8859-1.
> 
> Now what is it ?

  Whatever the browser wants, although Firefox may use the character
encoding the page was sent as (and the HTML spec---or was it the HTTP spec?
says the default is UTF-8).

> If I take the above quotation for instance, the part "User agents *may* 
> interpret " (the emphasis is mine only) kind of bothers me, in the sense 
> that it implies that the browser can do what it wants anyway.
> The other part that bothers me is that according to the above, the 
> "accept-charset" attribute can specify *a list* of character encodings, 
> and not just one.
> Then the above goes on to say "the server is able to accept any single 
> character encoding per entity received". What in this case is an 
> "entity" ? are we talking about the whole form submission, like in 
> "query string", or are we talking individual data items, as in the 
> individual "name=value" pairs ?

  From playing around with it, it seems to apply to the entire submission,
in that all the name/value pairs are encoded in a single character set.

> But that seems like some hideous overkill, and still not totally foolproof.
> (multipart/form-data also has the inconvenient that it does not play 
> very well with some authentication schemes using redirects)
> 
> It seems to me that the specifications are still not clear and/or not 
> tight enough.
> 
> Am I missing something ?

  I don't think so.  I think I ended up writing a CGI script to assume
UTF-8, and if it encountered a problem, switch to ISO-8859-1 and then
Windows-1251 (or some combination---it's around here somewhere).  I used the
GNU iconv library (at our company we use Linux, so it's easy to install and
use) to do the conversions.

  Messy, but it's about the best you can do.

  -spc (Even with the 'accept-charset' attribute, there may be some
	user-agent out there that doesn't support it, so you're
	still screwed ... )

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
   "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx