Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)

John C Klensin <john-ietf@xxxxxxx> · Mon, 30 Jan 2023 03:04:37 -0500

Let me add one small comment to Martin's:

A generic HTTP library that allows character encodings other
than UTF-8 (i.e., does not enforce UTF-8) needs to be very
careful that it does not make guesses about what non-UTF-8
encodings might mean, e.g., attempt to translate them to UTF-8
or other forms.  His comment about Windows-1251 is almost
certainly correct in this case.  However, assuming that, for a
URL whose domain-part is a subdomain of RU, the coding, if not
UTF-8, is necessarily Windows-1251 would be unreasonable and
dangerous: nothing in either 3986 or externally imposed rules
about what the RU TLD can register requires that non-ASCII
characters in URI tails be in Cyrillic and not Latin or, for
that matter, Mongolian, Arabic, or any other script.

   john

--On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"
<duerst@xxxxxxxxxxxxxxx> wrote:

> For the record, here's what I posted to the relevant github
> issue, for those who aren't subscribed to it:
> 
>  >>>>
> For a generic HTTP library, not enforcing http/https URLs to
> be UTF-8 is the right decision. But such a library should make
> it easy to use UTF-8 for URIs, And wherever possible, servers
> should use UTF-8 for their URIs if they contain non-ASCII
> characters, and should use a suitable baseXX encoding for
> binary data such as digital signatures and the like.
> 
> Btw, contrary to what @brandon93 says at the start of this
> thread,
> https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is
> not in Windows-1252 (Western Europe), but in Windows-1251
> (Russia). This of course makes sense because the site has a
> Russian domain name. The city is Таллин, in Latin
> letters this is Tallin. You can easily check this by using the
> URL in a browser. Using  Windows-1252 makes no sense because
> there is no language that contains words like "Òàëëèí"
> (accented vowels only).
> 
> This shows the advantage of using UTF-8. It avoids the mess of
> regional encodings, and because of its internal structure
> cannot easily be mistaken for some other encoding.
>  >>>>
> 
> Regards,   Martin.
> 
> On 2023-01-25 19:54, Raghu Saxena wrote:
>> 
>> On 1/25/23 17:47, Julian Reschke wrote:
>>> On 25.01.2023 10:04, Raghu Saxena wrote:
>>>> To whomever it may concern,
>>>> 
>>>> I am writing to seek clarification regarding the URI spec
>>>> (RFC3986) followed by HTTP, specifically about
>>>> percent-encoding arbitrary octets (which do not comprise a
>>>> valid UTF08 sequence). In the last paragraph of RFC3986
>>>> Section 2.5
>>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
>>>> it says,  quote:
>>>> 
>>>>  >  When a new URI scheme defines a component that
>>>> represents textual     data consisting of characters
>>>> from the Universal Character Set  [UCS],
>>>>     the data should first be encoded as octets according
>>>> to the UTF-8     character encoding [STD63]; then only
>>>> those octets that do not     correspond to characters in
>>>> the unreserved set should be percent-     encoded.
>>>> 
>>>> This implies that URI schemes defined after RFC3986 must
>>>> follow UTF-8 encoding in their URIs. However, the original
>>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not
>>>> have had to "abide" by the UTF-8 rule.
>>>> 
>>>> In fact, many web servers allow and process GET requests
>>>> with percent-encoded octets, which they decode as raw bytes
>>>> and have the application level logic handle how to process
>>>> them.
>>>> 
>>>> However, since HTTP's latest RFC is 9110, dated June 2022
>>>> (post RFC3986), does it mean the UTF-8 rule now applies to
>>>> it? I would think not, since this would be a breaking
>>>> change. But some comments on github indicate that this is
>>>> as per the spec ()
>>> 
>>> Pointer?
>>> 
>> My apologies, the comment is here: 
>> https://github.com/sindresorhus/got/issues/420#issuecomment-3
>> 45416645
>> 
>> 
>>>> tl;dr - Is it compliant with the HTTP specification to send
>>>> arbitrary bytes, which do not represent a valid UTF-8
>>>> sequence, via percent-encoding in the URL query parameter?
>>> 
>>> Yes.
>>> 
>>> The http scheme was not re-definey by RFCs after RFC 2616
>>> (in fact, it was defined even before that).
>>> 
>>> Best regards, Julian
>>> 
>> Thanks for the clarification regarding schemes not being
>> re-defined. I  will ask the library author to reconsider
>> 
>> Regards,
>> 
>> Raghu Saxena
>> 
>> (P.S. Sorry for the personal reply prior to this - my first
>> time using  mailing lists)