Re: URL encodig - percent encoding

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Carlos Alarcón wrote:
Hi, I am facing a problem which I haven't found any help over Internet (maybe I did the wrong search):
I have:Apache/2.2.3 (CentOS), Centos 5.
I am trying to access specials URLs ( containing non ASCII chars)

So my browser (MZ or IE) converts URL http://rewrite.tyven/eñe.html into http://rewrite.tyven/e%C3%B1e.html (which is ok, UTF-8 percent encoding following RFC3986). My apache gets that but seems to interpret that as ISO8859-1 so it tries to retrieve file: eñe.html

Hi Carlos.
In my opinion, it is not you that is wrong, nor Apache.
And it is not an OS or browser issue.
It is just that the situation is quite confusing.
And it is likely to remain quite confusing until (or if) both the HTTP protocol, and operating systems, adopt Unicode/UTF-8 as the default charset both for URLs and for filesystem directories and filenames.

The full explanation is long, so here is a shortcut :

The question is : how does a user (or browser) specify, in a URL, that
he wants a file called "eñe.html", with the "ñ" meaning a single byte in
the iso-8859-1 encoding ?

In most cases, it should not matter.  The user will rarely input this
URL directly in the browser's URL bar.  He will click on a link, that he
has received previously in a page that came from your server.
It is then the responsibility of the server, to make sure that the links
in that page, when clicked on, requests the correct resource in the
filesystem or otherwise.

In the practice, it is thus so : if in your local filesystem you create
filenames based on Unicode/UTF-8, then in the pages that you send to
your users, you should use an appropriate encoding, so
that when the user clicks on them, he gets the correct file.

If your links are not links into the file system, but things that must
be interpreted by some program at the server end (like when the user
POSTs a form to a cgi-bin script), then it is the responsibility of the application to make sure that there is consistency between what the browser is sending, and how the application understands it. (There are several measures to achieve that, unfortunately nothing at the moment is really 100% foolproof).

To put it another way : the browser is on the user side, and the user
can decide what is sent in his URLs (it is always an option in the
browser; the user can click or unclick the option "send URLs in UTF-8").
(It may not even be a browser : think of wget and curl.)

So basically when on the server side you get a HTTP request with a
filepath, you don't know in which character-set/encoding it was sent,
because there is not even a way available in a URL to say that.
If the browser or other sends a URL that requests ""eñe.html" expressed
in UTF-8, and on the server side there is no file with a name that
matches that (byte-by-byte), then the browser should/will get a 404 Not Found.
If the browser or other sends a URL that requests ""eñe.html" expressed
in UTF-8, and on the server side there is a file with a name that
matches that, then the browser will get the requested page.


Long explanation :

The filesystem :
As far as I know, Unix/Linux filesystems right now are what can be
called "encoding-neutral".  That means that the filenames for instance,
are composed of "bytes", each byte being just one group of 8 bits, with a byte value 0-255 decimal.

In other words, if a file in the filesystem has a name like (look at
each sign here as an individual byte) "eñe.html", then that filename is
9 bytes long, and is composed of the 9 bytes having the hexadecimal
values corresponding to those "characters"
(try "ls -1 e*.html| od -t x1 -c" as Eric says).

The fact that when you are on the server with a console, and you do a
"ls -l *" in that directory, and you see this filename as "eñe.html",
depends on the locale settings of the process which you use to view this
directory. So if you see it as "eñe.html", then most probably your
current locale is iso-8859-1.

If you were to change your current locale to one based on UTF-8 (and
adapt your terminal display accordingly), then (without changing the
filename in the filesystem), you would see that same filename as
"eñe.html".  That is because then, the pair of consecutive bytes which
before looked like "ñ", would be interpreted as the single Unicode
character "n with tilde", correctly encoded in the UTF-8 encoding.

For the same reason, if your current locale is based on UTF-8, and if
you enter the command : "touch eñe.html", this would create a file whose
name is 9 bytes long (the n tilde occupying 2 bytes), and which in the
UTF-8 encoding would be read as "eñe.html". But then if you would change
your locale to one based on the iso-8859-1 encoding (and your terminal
accordingly), and you did the same "ls -l", you would see this filename
again as "eñe.html".

Inversely, now that your locale is based on iso-8859-1, and you do again
a "touch eñe.html", then you would create a different file, whose name
consists only of 8 characters (because the n tilde would now be
translated to a single byte in the iso-8859-1 alphabet/encoding).
If you then switch your locale again to a UTF-8 based locale, and do a
"ls -l", you will see correctly the first file you created (whose name
consists of 9 bytes), but you would see the second file you created
probably as "e?e.html". That is because your current locale (UTF-8)
would be trying to interpret that series of 8 bytes, and find out that
the sequence starting at the second byte is a not a valid UTF-8
encoding, so it would display this as a "?".

Confusing, or clear ?

Apache :
URLs are not by default encoded as UTF-8. The definition of URLs in RFC 3986 says that (I summarise) any byte that cannot be represented by an ASCII character A-Za-z0-9 (and some more chars) should be "percent-encoded".
But it never says that a URL *is* Unicode. Re-read part 2.5 carefully.
(I am talking of the "file path" part; the DNS hostname part is
different, and handled in RFC3490 (check PUNYCODE]).

So, basically, it boils down to the fact that when Apache receives a
URL, which when percent-decoded, has a filename at the end that consists
of the 9 bytes "eñe.html", it is not going to translate that anymore,
and it is going to look on the filesystem for a file that is named with
those exact same 9 bytes.

In other words, it is not Apache's problem that the user asked for a
file named (in bytes) "eñe.html".  Apache thinks that this is exactly
what the browser wants, and looks for that.
If on the filesystem, the filenames are encoded in iso-8859-1, then
tough luck.



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
  "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx


[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux