If you talk about URIs with non-ascii, also think about bidirectionality, and do not mix up discussion about order of characters with display order. ‘/‘ is a funny character when displaying strings with characters that have different directionality.
Patrik
30 jan. 2023 kl. 18:05 skrev John C Klensin <john-ietf@xxxxxxx>:
Let me add one small comment to Martin's:A generic HTTP library that allows character encodings otherthan UTF-8 (i.e., does not enforce UTF-8) needs to be verycareful that it does not make guesses about what non-UTF-8encodings might mean, e.g., attempt to translate them to UTF-8or other forms. His comment about Windows-1251 is almostcertainly correct in this case. However, assuming that, for aURL whose domain-part is a subdomain of RU, the coding, if notUTF-8, is necessarily Windows-1251 would be unreasonable anddangerous: nothing in either 3986 or externally imposed rulesabout what the RU TLD can register requires that non-ASCIIcharacters in URI tails be in Cyrillic and not Latin or, forthat matter, Mongolian, Arabic, or any other script. john--On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"<duerst@xxxxxxxxxxxxxxx> wrote:For the record, here's what I posted to the relevant github
issue, for those who aren't subscribed to it:
For a generic HTTP library, not enforcing http/https URLs to
be UTF-8 is the right decision. But such a library should make
it easy to use UTF-8 for URIs, And wherever possible, servers
should use UTF-8 for their URIs if they contain non-ASCII
characters, and should use a suitable baseXX encoding for
binary data such as digital signatures and the like.
Btw, contrary to what @brandon93 says at the start of this
thread,
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is
not in Windows-1252 (Western Europe), but in Windows-1251
(Russia). This of course makes sense because the site has a
Russian domain name. The city is Таллин, in Latin
letters this is Tallin. You can easily check this by using the
URL in a browser. Using Windows-1252 makes no sense because
there is no language that contains words like "Òàëëèí"
(accented vowels only).
This shows the advantage of using UTF-8. It avoids the mess of
regional encodings, and because of its internal structure
cannot easily be mistaken for some other encoding.
Regards, Martin.
On 2023-01-25 19:54, Raghu Saxena wrote:
On 1/25/23 17:47, Julian Reschke wrote:
On 25.01.2023 10:04, Raghu Saxena wrote:
To whomever it may concern,
I am writing to seek clarification regarding the URI spec
(RFC3986) followed by HTTP, specifically about
percent-encoding arbitrary octets (which do not comprise a
valid UTF08 sequence). In the last paragraph of RFC3986
Section 2.5
(https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
it says, quote:
> When a new URI scheme defines a component that
represents textual data consisting of characters
from the Universal Character Set [UCS],
the data should first be encoded as octets according
to the UTF-8 character encoding [STD63]; then only
those octets that do not correspond to characters in
the unreserved set should be percent- encoded.
This implies that URI schemes defined after RFC3986 must
follow UTF-8 encoding in their URIs. However, the original
HTTP/1.1 RFC (2616) was dated June 1999, and so would not
have had to "abide" by the UTF-8 rule.
In fact, many web servers allow and process GET requests
with percent-encoded octets, which they decode as raw bytes
and have the application level logic handle how to process
them.
However, since HTTP's latest RFC is 9110, dated June 2022
(post RFC3986), does it mean the UTF-8 rule now applies to
it? I would think not, since this would be a breaking
change. But some comments on github indicate that this is
as per the spec ()
Pointer?
My apologies, the comment is here:
https://github.com/sindresorhus/got/issues/420#issuecomment-3
45416645
tl;dr - Is it compliant with the HTTP specification to send
arbitrary bytes, which do not represent a valid UTF-8
sequence, via percent-encoding in the URL query parameter?
Yes.
The http scheme was not re-definey by RFCs after RFC 2616
(in fact, it was defined even before that).
Best regards, Julian
Thanks for the clarification regarding schemes not being
re-defined. I will ask the library author to reconsider
Regards,
Raghu Saxena
(P.S. Sorry for the personal reply prior to this - my first
time using mailing lists)
|