what is the charset of a URL ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi.

I have been wondering for a while about how a server application should really consider the "query string" part of a URL, in terms of character encoding. I am talking here of a URL of the form
http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
(the part after the question mark)

Starting with a quote from
http://www.w3.org/TR/html401/interact/forms.html#h-17.3 :

accept-charset = charset list [CI]
This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received. The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

Some people (to which I belong), after trying to digest the various RFCs and other recommendations that seem to deal with the subject (e.g. RFC3986 and the document above), come to the conclusion that the character set and/or encoding of the query string, after percent-decoding, is basically undefined from a server's point of view.
Others seem to be convinced that it is Unicode encoded as UTF-8.
Yet others that it is, by default, iso-8859-1.

Now what is it ?
If I take the above quotation for instance, the part "User agents *may* interpret " (the emphasis is mine only) kind of bothers me, in the sense that it implies that the browser can do what it wants anyway. The other part that bothers me is that according to the above, the "accept-charset" attribute can specify *a list* of character encodings, and not just one. Then the above goes on to say "the server is able to accept any single character encoding per entity received". What in this case is an "entity" ? are we talking about the whole form submission, like in "query string", or are we talking individual data items, as in the individual "name=value" pairs ?

So basically, what will the browser pick, and how would the server know what it picked ?

One could argue that the server should only send forms as follows :
- the server response to the browser should contain a "Content-Type:" header that specifies not only the Mime type "text/html" (or equivalent), but add a "charset" attribute. - the html document being sent should contain a <meta> tag that explicitly provides the document charset/encoding, like
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />.
- the <form> in the document should specify an "accept-charset" attribute, preferably with a single charset/encoding like "utf-8".

That's all nice and well, but

a) if this incoming URL is something typed by a user in the URL bar of the browser, there is no such previous response sent by the server. b) HTTP being a connection-less protocol, the server should anyway not have any recollection that it has previously sent such a form to the same browser (yesterday ?), so when a request comes in, the server doesn't know any of these things above for sure c) the browser may decide to do whatever it pleases and disregard what the server told it (IE comes to mind, practical examples on request). It should then be in violation of the specifications, but considering the above I'm not so sure it is clear-cut.

For a while now, I have resorted to do all the things above, and in addition to always sending forms specifying "enctype=multipart/form-data", for which the problem should not exist. In addition, I make sure that each form contains a hidden field, itself containing a string with a content known to the application, which upon form submission can be checked for any discrepancy (at least between UTF-8 and an ISO-8859 encoding; it can unfortunately not distinguish between different iso-8859 encodings).

But that seems like some hideous overkill, and still not totally foolproof.
(multipart/form-data also has the inconvenient that it does not play very well with some authentication schemes using redirects)

It seems to me that the specifications are still not clear and/or not tight enough.

Am I missing something ?

(And yes I know about PUNYCODE, but in my understanding that applies to DNS hostnames, not to query strings.)





---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
  "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx


[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux