Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)

John C Klensin <john-ietf@xxxxxxx> · Mon, 30 Jan 2023 20:43:06 -0500

Martin,

Yes, that was the point.  Thanks for further clarifying.

   john

--On Tuesday, January 31, 2023 10:03 +0900 "Martin J. Dürst"
<duerst@xxxxxxxxxxxxxxx> wrote:

> On 2023-01-30 17:04, John C Klensin wrote:
>> Let me add one small comment to Martin's:
>> 
>> A generic HTTP library that allows character encodings other
>> than UTF-8 (i.e., does not enforce UTF-8) needs to be very
>> careful that it does not make guesses about what non-UTF-8
>> encodings might mean, e.g., attempt to translate them to UTF-8
>> or other forms.  His comment about Windows-1251 is almost
>> certainly correct in this case.
> 
> Assuming you know the Cyrillic alphabet, or have an
> explanatory document at hand, you can easily confirm that it
> IS correct if you go to the site. But you're right that it's
> only for this specific case.
> 
>> However, assuming that, for a
>> URL whose domain-part is a subdomain of RU, the coding, if not
>> UTF-8, is necessarily Windows-1251 would be unreasonable and
>> dangerous: nothing in either 3986 or externally imposed rules
>> about what the RU TLD can register requires that non-ASCII
>> characters in URI tails be in Cyrillic and not Latin or, for
>> that matter, Mongolian, Arabic, or any other script.
> 
> Yes of course it could be something else than Cyrillic. Even
> for Cyrillic, it could be KOI-8 or one of its variants, or it
> could be Windows-1251, or it could be ISO 8859-5.
> 
> Regards,   Martin.
> 
> 
>>     john
>> 
>>   
>> --On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"
>> <duerst@xxxxxxxxxxxxxxx> wrote:
>> 
>>> For the record, here's what I posted to the relevant github
>>> issue, for those who aren't subscribed to it:
>>> 
>>>   >>>> 
>>> For a generic HTTP library, not enforcing http/https URLs to
>>> be UTF-8 is the right decision. But such a library should
>>> make it easy to use UTF-8 for URIs, And wherever possible,
>>> servers should use UTF-8 for their URIs if they contain
>>> non-ASCII characters, and should use a suitable baseXX
>>> encoding for binary data such as digital signatures and the
>>> like.
>>> 
>>> Btw, contrary to what @brandon93 says at the start of this
>>> thread,
>>> https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/
>>> is not in Windows-1252 (Western Europe), but in Windows-1251
>>> (Russia). This of course makes sense because the site has a
>>> Russian domain name. The city is Таллин, in Latin
>>> letters this is Tallin. You can easily check this by using
>>> the URL in a browser. Using  Windows-1252 makes no sense
>>> because there is no language that contains words like
>>> "Òàëëèí" (accented vowels only).
>>> 
>>> This shows the advantage of using UTF-8. It avoids the mess
>>> of regional encodings, and because of its internal structure
>>> cannot easily be mistaken for some other encoding.
>>>   >>>> 
>>> 
>>> Regards,   Martin.
>>> 
>>> On 2023-01-25 19:54, Raghu Saxena wrote:
>>>> 
>>>> On 1/25/23 17:47, Julian Reschke wrote:
>>>>> On 25.01.2023 10:04, Raghu Saxena wrote:
>>>>>> To whomever it may concern,
>>>>>> 
>>>>>> I am writing to seek clarification regarding the URI spec
>>>>>> (RFC3986) followed by HTTP, specifically about
>>>>>> percent-encoding arbitrary octets (which do not comprise a
>>>>>> valid UTF08 sequence). In the last paragraph of RFC3986
>>>>>> Section 2.5
>>>>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
>>>>>> it says,  quote:
>>>>>> 
>>>>>>   >  When a new URI scheme defines a component that
>>>>>> represents textual     data consisting of characters
>>>>>> from the Universal Character Set  [UCS],
>>>>>>      the data should first be encoded as octets
>>>>>>  according to the UTF-8     character encoding
>>>>>> [STD63]; then only those octets that do not    
>>>>>> correspond to characters in the unreserved set should be
>>>>>> percent-     encoded.
>>>>>> 
>>>>>> This implies that URI schemes defined after RFC3986 must
>>>>>> follow UTF-8 encoding in their URIs. However, the original
>>>>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not
>>>>>> have had to "abide" by the UTF-8 rule.
>>>>>> 
>>>>>> In fact, many web servers allow and process GET requests
>>>>>> with percent-encoded octets, which they decode as raw
>>>>>> bytes and have the application level logic handle how to
>>>>>> process them.
>>>>>> 
>>>>>> However, since HTTP's latest RFC is 9110, dated June 2022
>>>>>> (post RFC3986), does it mean the UTF-8 rule now applies to
>>>>>> it? I would think not, since this would be a breaking
>>>>>> change. But some comments on github indicate that this is
>>>>>> as per the spec ()
>>>>> 
>>>>> Pointer?
>>>>> 
>>>> My apologies, the comment is here:
>>>> https://github.com/sindresorhus/got/issues/420#issuecomment
>>>> -3 45416645
>>>> 
>>>> 
>>>>>> tl;dr - Is it compliant with the HTTP specification to
>>>>>> send arbitrary bytes, which do not represent a valid UTF-8
>>>>>> sequence, via percent-encoding in the URL query parameter?
>>>>> 
>>>>> Yes.
>>>>> 
>>>>> The http scheme was not re-definey by RFCs after RFC 2616
>>>>> (in fact, it was defined even before that).
>>>>> 
>>>>> Best regards, Julian
>>>>> 
>>>> Thanks for the clarification regarding schemes not being
>>>> re-defined. I  will ask the library author to reconsider
>>>> 
>>>> Regards,
>>>> 
>>>> Raghu Saxena
>>>> 
>>>> (P.S. Sorry for the personal reply prior to this - my first
>>>> time using  mailing lists)
>>