Re: Wrong charset convert

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



ejirkae@xxxxxxxxx wrote:
This is that problem: http://sgo.happyforever.com/test.php
(http://sgo.happyforever.com/test.php)
Try it please, thanks.

------------ Původní zpráva ------------
Od: <ejirkae@xxxxxxxxx>
Předmět:  Wrong charset convert
Datum: 01.7.2009 00:03:06
---------------------------------------------
I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. Windows are using Windows-1250 charset (Czech localization). I want to install MediaWiki software which uses utf-8 charset.

When I upload a file with non-english characters in its name, then its name is saved in utf-8 format. When I try to open such file in web browser it sends 404 not found status.

Example:

Upload a file by using simple html upload form, which is encoded in utf-8:

<!-- this is only part of whole code --!>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
<body>

<form enctype="multipart/form-data" action="uploader.php" method=
"POST">
<input type="hidden" name="MAX_FILE_SIZE" value="100000" />
Choose a file to upload: <input name="uploadedfile" type="file" /><
br />
<input type="submit" value="Upload File" />

</form>
</body>
</html>

File named for example "složka.png" is saved to hard drive with name
"sloĹľka.png" in Windows-1250 encoding.
(This is not true, see below)

 If that upload form was
encoded with charset=Windows-1250 then it'll be right named "složka.
png", but charset must be utf-8.

So suppose that we have server with uploaded file: http://something.
com/složka.png. On linux it is working fine. But on Windows server you must use address like that: http://something.com/sloĹľka.png and
that's not good for MediaWiki.

I don't know if it's understandably enough, I need set up Apache to ignore windows-1250 charset and use original utf-8 for decoding URL.
httpd.conf is original (with php installation).

Thanks for help
Jiri Eichler

Jiri,
the issue you are explaining above is not an easy one.
It will really be solved only, whenever the powers-that-be on the Internet, finally decide to move to an HTTP version 2.0, where everything by default would be Unicode, UTF-8 encoded. Until then, there will be confusion and difficulties for whoever does not use English as his main language.

--- Part I -------

First, about your last paragraph :
Apache will not use UTF-8 to decode a URL, because that would be wrong according to the current RFCs that specifiy how the WWW is working.
The "law" in that respect is defined here :
http://www.ietf.org/rfc/rfc2396.txt
See section : 1.5. URI Transcribability

It is all a bit obscure, but basically what it boils down to is :
when a server receives a URL :
- it first decodes the URL, to convert the "percent-escaped" characters back into single characters. That means, for instance, that a "%20" is decoded into a space.
- then it does *no further decoding*, it takes the bytes *as they are*.
They are *not supposed* to be decoded any further, using iso-8859-1, cp-1250, UTF-8 or whatever.
(If Apache did that, then Apache would not respect the RFC).

Now, let's say that in this URL, is a path pointing to some resource, which in this case is a file on disk. Well then, the webserver should take this path exactly as received, and look for a file on disk whose name matches exactly that path, byte by byte.

But, between the webserver and the disk, there is an operating system.
The webserver does not read the disk directly. It does that through the OS I/O interface calls. So, it is possible that when the webserver looks for a file called "xyz123.html", the OS interface translates that to "XYZ123.HTML" for example, and returns /that/ file. That is for example the case for Windows. For "xyz123.html", Windows will return any file that is named "Xyz123.html", or "yYz123.html", or "XYZ123.html" etc.. because when looking for files, Windows is case-insensitive. If the webserver does not double-check this (some do), then it may thus return the wrong file. The same kind of thing can happen with "diacritic" characters, such as your "složka.png".

--------- Part II -----------

Uploading files and writing them to disk.
This is a separate issue.

The script that handles the <form> which is used to upload the file, knows that the filename is Unicode, encoded as UTF-8. (It knows that, because you wrote the <form> and the script, and in your <form>, you have told the browser to send information in UTF-8).

In the UTF-8 encoding, the filename "složka.png", consists of *10 characters*, but of *11 bytes*. That is because the "ž" in the middle, is encoded using 2 bytes in UTF-8. If you look at this filename with an editor which understands UTF-8, you will see this as "složka.png". If you look at this same filename with an editor which does not understand UTF-8 (or is set to iso-8859-2), then you will see this same string as something like "sloĹľka.png" (or something else like that, I have not really checked).

But back to your upload script.

It has this uploaded file name, in Unicode UTF-8, as "složka.png".
Now it wants to create this file on disk.
For that, it tells the OS : create file "složka.png".
The OS takes this file name, and depending on several conditions (**), understands this name literally as either a series of *bytes* (11 of them), or as a series of *characters* (10 of them) in UTF-8 encoding. And the OS, according to its understanding, creates a directory entry on disk for this filename. In your case, it creates an entry in the disk directory, containing the /bytes/ (or /characters/) "sloĹľka.png".

It does that, because your script does it wrong :
The script "knows" that this filename is encoded in UTF-8.
But the OS does not know that.
The script /should know/ how the OS is going to understand that, and should, if needed, re-encode this filename in the proper encoding, so that the OS understands it correctly, and creates a file named "složka.png".

It is not that a file named "sloĹľka.png" is wrong. It is, in itself, a perfectly valid filename.
But the problem is that, considering Part I above :
- your users are going to type a URL in the location bar of their browser
- for that, they are going to use the keyboard that they have, on their workstation, with their OS and their browser etc... (for example, I could never type it, because I don't have a key for "ž" on my keyboard; so I have to cut and paste from your email ;-))
- So they are going to type, for example :
http://yourhost.yourcompany.com/uploadedfiles/složka.png

- The browser is going to URL-encode that, probably replacing the "ž" by a 3-character "percent-sequence" like %B3 (or even 2 3-character sequences, if the browser thinks it must encode the URL as UTF-8).
- the browser is then going to "send this URL" to Apache.
- Apache will receive this URL, decode the %-sequences into *bytes*, and ask the OS for this file.

------ Part III ----

Now, IF the two translations match (the one which happened when you uploaded the file, and the one which happens between the user and the server disk), then the file will be found.
And otherwise, it will not be.

Your case is that the two translations do not match.

----- Part IV : how to resolve this --------

My suggestion :
do /not/ allow the users to decide under which name the file is really stored on the disk. Create an "alias" for the filename, containing only US-ASCII characters, and store the file under that name. And then, arrange that when the users ask for the file "složka.png" (this name appears for example on an index page that you create), in reality your webserver is looking for this alias name. (*)

This is the only way to make your application really portable, because in the end, on the WWW, you never know who or where the user is, what his workstation is, what his OS is, etc.. So the user could upload a file under a name that gives you a lot of trouble on your server (as you have discovered already, but not entirely). For example, one user could upload a file named "složka.png", and another user could upload a file called "Složka.png". If your server is Windows, and if you are not careful, the second file will overwrite the first.
There are many other such problematic cases.

And if MediaWiki does not do that, then MediaWiki is not a portable application, sorry. The problem is not the webserver, the problem is the application.

(and, in part, HTTP 1.x)

(*) you show for example an index page like :
<a href="/files/20090630-180667-123456.png">složka.png</a>

(**) which can be, for example, the "locale" under which the Apache process is running.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
  "   from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx


[Index of Archives]     [Open SSH Users]     [Linux ACPI]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Squid]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux