Let me add one small comment to Martin's: A generic HTTP library that allows character encodings other than UTF-8 (i.e., does not enforce UTF-8) needs to be very careful that it does not make guesses about what non-UTF-8 encodings might mean, e.g., attempt to translate them to UTF-8 or other forms. His comment about Windows-1251 is almost certainly correct in this case. However, assuming that, for a URL whose domain-part is a subdomain of RU, the coding, if not UTF-8, is necessarily Windows-1251 would be unreasonable and dangerous: nothing in either 3986 or externally imposed rules about what the RU TLD can register requires that non-ASCII characters in URI tails be in Cyrillic and not Latin or, for that matter, Mongolian, Arabic, or any other script. john --On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx> wrote: > For the record, here's what I posted to the relevant github > issue, for those who aren't subscribed to it: > > >>>> > For a generic HTTP library, not enforcing http/https URLs to > be UTF-8 is the right decision. But such a library should make > it easy to use UTF-8 for URIs, And wherever possible, servers > should use UTF-8 for their URIs if they contain non-ASCII > characters, and should use a suitable baseXX encoding for > binary data such as digital signatures and the like. > > Btw, contrary to what @brandon93 says at the start of this > thread, > https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is > not in Windows-1252 (Western Europe), but in Windows-1251 > (Russia). This of course makes sense because the site has a > Russian domain name. The city is Таллин, in Latin > letters this is Tallin. You can easily check this by using the > URL in a browser. Using Windows-1252 makes no sense because > there is no language that contains words like "Òàëëèí" > (accented vowels only). > > This shows the advantage of using UTF-8. It avoids the mess of > regional encodings, and because of its internal structure > cannot easily be mistaken for some other encoding. > >>>> > > Regards, Martin. > > On 2023-01-25 19:54, Raghu Saxena wrote: >> >> On 1/25/23 17:47, Julian Reschke wrote: >>> On 25.01.2023 10:04, Raghu Saxena wrote: >>>> To whomever it may concern, >>>> >>>> I am writing to seek clarification regarding the URI spec >>>> (RFC3986) followed by HTTP, specifically about >>>> percent-encoding arbitrary octets (which do not comprise a >>>> valid UTF08 sequence). In the last paragraph of RFC3986 >>>> Section 2.5 >>>> (https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5), >>>> it says, quote: >>>> >>>> > When a new URI scheme defines a component that >>>> represents textual data consisting of characters >>>> from the Universal Character Set [UCS], >>>> the data should first be encoded as octets according >>>> to the UTF-8 character encoding [STD63]; then only >>>> those octets that do not correspond to characters in >>>> the unreserved set should be percent- encoded. >>>> >>>> This implies that URI schemes defined after RFC3986 must >>>> follow UTF-8 encoding in their URIs. However, the original >>>> HTTP/1.1 RFC (2616) was dated June 1999, and so would not >>>> have had to "abide" by the UTF-8 rule. >>>> >>>> In fact, many web servers allow and process GET requests >>>> with percent-encoded octets, which they decode as raw bytes >>>> and have the application level logic handle how to process >>>> them. >>>> >>>> However, since HTTP's latest RFC is 9110, dated June 2022 >>>> (post RFC3986), does it mean the UTF-8 rule now applies to >>>> it? I would think not, since this would be a breaking >>>> change. But some comments on github indicate that this is >>>> as per the spec () >>> >>> Pointer? >>> >> My apologies, the comment is here: >> https://github.com/sindresorhus/got/issues/420#issuecomment-3 >> 45416645 >> >> >>>> tl;dr - Is it compliant with the HTTP specification to send >>>> arbitrary bytes, which do not represent a valid UTF-8 >>>> sequence, via percent-encoding in the URL query parameter? >>> >>> Yes. >>> >>> The http scheme was not re-definey by RFCs after RFC 2616 >>> (in fact, it was defined even before that). >>> >>> Best regards, Julian >>> >> Thanks for the clarification regarding schemes not being >> re-defined. I will ask the library author to reconsider >> >> Regards, >> >> Raghu Saxena >> >> (P.S. Sorry for the personal reply prior to this - my first >> time using mailing lists)