Martin, Yes, that was the point. Thanks for further clarifying. john --On Tuesday, January 31, 2023 10:03 +0900 "Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx> wrote: > On 2023-01-30 17:04, John C Klensin wrote: >> Let me add one small comment to Martin's: >> >> A generic HTTP library that allows character encodings other >> than UTF-8 (i.e., does not enforce UTF-8) needs to be very >> careful that it does not make guesses about what non-UTF-8 >> encodings might mean, e.g., attempt to translate them to UTF-8 >> or other forms. His comment about Windows-1251 is almost >> certainly correct in this case. > > Assuming you know the Cyrillic alphabet, or have an > explanatory document at hand, you can easily confirm that it > IS correct if you go to the site. But you're right that it's > only for this specific case. > >> However, assuming that, for a >> URL whose domain-part is a subdomain of RU, the coding, if not >> UTF-8, is necessarily Windows-1251 would be unreasonable and >> dangerous: nothing in either 3986 or externally imposed rules >> about what the RU TLD can register requires that non-ASCII >> characters in URI tails be in Cyrillic and not Latin or, for >> that matter, Mongolian, Arabic, or any other script. > > Yes of course it could be something else than Cyrillic. Even > for Cyrillic, it could be KOI-8 or one of its variants, or it > could be Windows-1251, or it could be ISO 8859-5. > > Regards, Martin. > > >> john >> >> >> --On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst" >> <duerst@xxxxxxxxxxxxxxx> wrote: >> >>> For the record, here's what I posted to the relevant github >>> issue, for those who aren't subscribed to it: >>> >>> >>>> >>> For a generic HTTP library, not enforcing http/https URLs to >>> be UTF-8 is the right decision. But such a library should >>> make it easy to use UTF-8 for URIs, And wherever possible, >>> servers should use UTF-8 for their URIs if they contain >>> non-ASCII characters, and should use a suitable baseXX >>> encoding for binary data such as digital signatures and the >>> like. >>> >>> Btw, contrary to what @brandon93 says at the start of this >>> thread, >>> https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ >>> is not in Windows-1252 (Western Europe), but in Windows-1251 >>> (Russia). This of course makes sense because the site has a >>> Russian domain name. The city is Таллин, in Latin >>> letters this is Tallin. You can easily check this by using >>> the URL in a browser. Using Windows-1252 makes no sense >>> because there is no language that contains words like >>> "Òàëëèí" (accented vowels only). >>> >>> This shows the advantage of using UTF-8. It avoids the mess >>> of regional encodings, and because of its internal structure >>> cannot easily be mistaken for some other encoding. >>> >>>> >>> >>> Regards, Martin. >>> >>> On 2023-01-25 19:54, Raghu Saxena wrote: >>>> >>>> On 1/25/23 17:47, Julian Reschke wrote: >>>>> On 25.01.2023 10:04, Raghu Saxena wrote: >>>>>> To whomever it may concern, >>>>>> >>>>>> I am writing to seek clarification regarding the URI spec >>>>>> (RFC3986) followed by HTTP, specifically about >>>>>> percent-encoding arbitrary octets (which do not comprise a >>>>>> valid UTF08 sequence). In the last paragraph of RFC3986 >>>>>> Section 2.5 >>>>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5), >>>>>> it says, quote: >>>>>> >>>>>> > When a new URI scheme defines a component that >>>>>> represents textual data consisting of characters >>>>>> from the Universal Character Set [UCS], >>>>>> the data should first be encoded as octets >>>>>> according to the UTF-8 character encoding >>>>>> [STD63]; then only those octets that do not >>>>>> correspond to characters in the unreserved set should be >>>>>> percent- encoded. >>>>>> >>>>>> This implies that URI schemes defined after RFC3986 must >>>>>> follow UTF-8 encoding in their URIs. However, the original >>>>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not >>>>>> have had to "abide" by the UTF-8 rule. >>>>>> >>>>>> In fact, many web servers allow and process GET requests >>>>>> with percent-encoded octets, which they decode as raw >>>>>> bytes and have the application level logic handle how to >>>>>> process them. >>>>>> >>>>>> However, since HTTP's latest RFC is 9110, dated June 2022 >>>>>> (post RFC3986), does it mean the UTF-8 rule now applies to >>>>>> it? I would think not, since this would be a breaking >>>>>> change. But some comments on github indicate that this is >>>>>> as per the spec () >>>>> >>>>> Pointer? >>>>> >>>> My apologies, the comment is here: >>>> https://github.com/sindresorhus/got/issues/420#issuecomment >>>> -3 45416645 >>>> >>>> >>>>>> tl;dr - Is it compliant with the HTTP specification to >>>>>> send arbitrary bytes, which do not represent a valid UTF-8 >>>>>> sequence, via percent-encoding in the URL query parameter? >>>>> >>>>> Yes. >>>>> >>>>> The http scheme was not re-definey by RFCs after RFC 2616 >>>>> (in fact, it was defined even before that). >>>>> >>>>> Best regards, Julian >>>>> >>>> Thanks for the clarification regarding schemes not being >>>> re-defined. I will ask the library author to reconsider >>>> >>>> Regards, >>>> >>>> Raghu Saxena >>>> >>>> (P.S. Sorry for the personal reply prior to this - my first >>>> time using mailing lists) >>