Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)

Martin J. Dürst <duerst@xxxxxxxxxxxxxxx> · Mon, 30 Jan 2023 13:03:35 +0900

For the record, here's what I posted to the relevant github issue, for 
those who aren't subscribed to it:

>>>>
For a generic HTTP library, not enforcing http/https URLs to be UTF-8 is 
the right decision. But such a library should make it easy to use UTF-8 
for URIs, And wherever possible, servers should use UTF-8 for their URIs 
if they contain non-ASCII characters, and should use a suitable baseXX 
encoding for binary data such as digital signatures and the like.

Btw, contrary to what @brandon93 says at the start of this thread, 
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is not in 
Windows-1252 (Western Europe), but in Windows-1251 (Russia). This of 
course makes sense because the site has a Russian domain name. The city 
is Таллин, in Latin letters this is Tallin. You can easily check this by 
using the URL in a browser. Using  Windows-1252 makes no sense because 
there is no language that contains words like "Òàëëèí" (accented vowels 
only).

This shows the advantage of using UTF-8. It avoids the mess of regional 
encodings, and because of its internal structure cannot easily be 
mistaken for some other encoding.
>>>>

Regards,   Martin.

On 2023-01-25 19:54, Raghu Saxena wrote:

On 1/25/23 17:47, Julian Reschke wrote:
On 25.01.2023 10:04, Raghu Saxena wrote:
To whomever it may concern,

I am writing to seek clarification regarding the URI spec (RFC3986)
followed by HTTP, specifically about percent-encoding arbitrary octets
(which do not comprise a valid UTF08 sequence). In the last paragraph of
RFC3986 Section 2.5
(https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5), it says, 
quote:

 >  When a new URI scheme defines a component that represents textual
    data consisting of characters from the Universal Character Set 
[UCS],
    the data should first be encoded as octets according to the UTF-8
    character encoding [STD63]; then only those octets that do not
    correspond to characters in the unreserved set should be percent-
    encoded.

This implies that URI schemes defined after RFC3986 must follow UTF-8
encoding in their URIs. However, the original HTTP/1.1 RFC (2616) was
dated June 1999, and so would not have had to "abide" by the UTF-8 rule.

In fact, many web servers allow and process GET requests with
percent-encoded octets, which they decode as raw bytes and have the
application level logic handle how to process them.

However, since HTTP's latest RFC is 9110, dated June 2022 (post
RFC3986), does it mean the UTF-8 rule now applies to it? I would think
not, since this would be a breaking change. But some comments on github
indicate that this is as per the spec ()

Pointer?

My apologies, the comment is here: 
https://github.com/sindresorhus/got/issues/420#issuecomment-345416645

tl;dr - Is it compliant with the HTTP specification to send arbitrary
bytes, which do not represent a valid UTF-8 sequence, via
percent-encoding in the URL query parameter?

Yes.

The http scheme was not re-definey by RFCs after RFC 2616 (in fact, it
was defined even before that).

Best regards, Julian

Thanks for the clarification regarding schemes not being re-defined. I 
will ask the library author to reconsider

Regards,

Raghu Saxena

(P.S. Sorry for the personal reply prior to this - my first time using 
mailing lists)

--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan