On 2023-01-30 17:04, John C Klensin wrote:
Let me add one small comment to Martin's:
A generic HTTP library that allows character encodings other
than UTF-8 (i.e., does not enforce UTF-8) needs to be very
careful that it does not make guesses about what non-UTF-8
encodings might mean, e.g., attempt to translate them to UTF-8
or other forms. His comment about Windows-1251 is almost
certainly correct in this case.
Assuming you know the Cyrillic alphabet, or have an explanatory document
at hand, you can easily confirm that it IS correct if you go to the
site. But you're right that it's only for this specific case.
However, assuming that, for a
URL whose domain-part is a subdomain of RU, the coding, if not
UTF-8, is necessarily Windows-1251 would be unreasonable and
dangerous: nothing in either 3986 or externally imposed rules
about what the RU TLD can register requires that non-ASCII
characters in URI tails be in Cyrillic and not Latin or, for
that matter, Mongolian, Arabic, or any other script.
Yes of course it could be something else than Cyrillic. Even for
Cyrillic, it could be KOI-8 or one of its variants, or it could be
Windows-1251, or it could be ISO 8859-5.
Regards, Martin.
john
--On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"
<duerst@xxxxxxxxxxxxxxx> wrote:
For the record, here's what I posted to the relevant github
issue, for those who aren't subscribed to it:
>>>>
For a generic HTTP library, not enforcing http/https URLs to
be UTF-8 is the right decision. But such a library should make
it easy to use UTF-8 for URIs, And wherever possible, servers
should use UTF-8 for their URIs if they contain non-ASCII
characters, and should use a suitable baseXX encoding for
binary data such as digital signatures and the like.
Btw, contrary to what @brandon93 says at the start of this
thread,
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is
not in Windows-1252 (Western Europe), but in Windows-1251
(Russia). This of course makes sense because the site has a
Russian domain name. The city is Таллин, in Latin
letters this is Tallin. You can easily check this by using the
URL in a browser. Using Windows-1252 makes no sense because
there is no language that contains words like "Òàëëèí"
(accented vowels only).
This shows the advantage of using UTF-8. It avoids the mess of
regional encodings, and because of its internal structure
cannot easily be mistaken for some other encoding.
>>>>
Regards, Martin.
On 2023-01-25 19:54, Raghu Saxena wrote:
On 1/25/23 17:47, Julian Reschke wrote:
On 25.01.2023 10:04, Raghu Saxena wrote:
To whomever it may concern,
I am writing to seek clarification regarding the URI spec
(RFC3986) followed by HTTP, specifically about
percent-encoding arbitrary octets (which do not comprise a
valid UTF08 sequence). In the last paragraph of RFC3986
Section 2.5
(https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
it says, quote:
> When a new URI scheme defines a component that
represents textual data consisting of characters
from the Universal Character Set [UCS],
the data should first be encoded as octets according
to the UTF-8 character encoding [STD63]; then only
those octets that do not correspond to characters in
the unreserved set should be percent- encoded.
This implies that URI schemes defined after RFC3986 must
follow UTF-8 encoding in their URIs. However, the original
HTTP/1.1 RFC (2616) was dated June 1999, and so would not
have had to "abide" by the UTF-8 rule.
In fact, many web servers allow and process GET requests
with percent-encoded octets, which they decode as raw bytes
and have the application level logic handle how to process
them.
However, since HTTP's latest RFC is 9110, dated June 2022
(post RFC3986), does it mean the UTF-8 rule now applies to
it? I would think not, since this would be a breaking
change. But some comments on github indicate that this is
as per the spec ()
Pointer?
My apologies, the comment is here:
https://github.com/sindresorhus/got/issues/420#issuecomment-3
45416645
tl;dr - Is it compliant with the HTTP specification to send
arbitrary bytes, which do not represent a valid UTF-8
sequence, via percent-encoding in the URL query parameter?
Yes.
The http scheme was not re-definey by RFCs after RFC 2616
(in fact, it was defined even before that).
Best regards, Julian
Thanks for the clarification regarding schemes not being
re-defined. I will ask the library author to reconsider
Regards,
Raghu Saxena
(P.S. Sorry for the personal reply prior to this - my first
time using mailing lists)
--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan