Re: Clarification regarding URI (RFC3986) spec followed by HTTP (RFC9110)

Patrik Fältström <paf=40paftech.se@xxxxxxxxxxxxxx> · Mon, 30 Jan 2023 22:52:24 +1000

If you talk about URIs with non-ascii, also think about bidirectionality, and do not mix up discussion about order of characters with display order. ‘/‘ is a funny character when displaying strings with characters that have different directionality.

Mixing different scripts is hard
paftech.se 

More examples of R2L and L2R characters
paftech.se 

Third example of Bidi issues
paftech.se 

   Patrik

30 jan. 2023 kl. 18:05 skrev John C Klensin <john-ietf@xxxxxxx>:

Let me add one small comment to Martin's:

A generic HTTP library that allows character encodings other
than UTF-8 (i.e., does not enforce UTF-8) needs to be very
careful that it does not make guesses about what non-UTF-8
encodings might mean, e.g., attempt to translate them to UTF-8
or other forms.  His comment about Windows-1251 is almost
certainly correct in this case.  However, assuming that, for a
URL whose domain-part is a subdomain of RU, the coding, if not
UTF-8, is necessarily Windows-1251 would be unreasonable and
dangerous: nothing in either 3986 or externally imposed rules
about what the RU TLD can register requires that non-ASCII
characters in URI tails be in Cyrillic and not Latin or, for
that matter, Mongolian, Arabic, or any other script.

   john

--On Monday, January 30, 2023 13:03 +0900 "Martin J. Dürst"
<duerst@xxxxxxxxxxxxxxx> wrote:

For the record, here's what I posted to the relevant github
issue, for those who aren't subscribed to it:

For a generic HTTP library, not enforcing http/https URLs to
be UTF-8 is the right decision. But such a library should make
it easy to use UTF-8 for URIs, And wherever possible, servers
should use UTF-8 for their URIs if they contain non-ASCII
characters, and should use a suitable baseXX encoding for
binary data such as digital signatures and the like.

Btw, contrary to what @brandon93 says at the start of this
thread,
https://www.kinopoisk.ru/community/city/%D2%E0%EB%EB%E8%ED/ is
not in Windows-1252 (Western Europe), but in Windows-1251
(Russia). This of course makes sense because the site has a
Russian domain name. The city is Таллин, in Latin
letters this is Tallin. You can easily check this by using the
URL in a browser. Using  Windows-1252 makes no sense because
there is no language that contains words like "Òàëëèí"
(accented vowels only).

This shows the advantage of using UTF-8. It avoids the mess of
regional encodings, and because of its internal structure
cannot easily be mistaken for some other encoding.

Regards,   Martin.

On 2023-01-25 19:54, Raghu Saxena wrote:

On 1/25/23 17:47, Julian Reschke wrote:
On 25.01.2023 10:04, Raghu Saxena wrote:
To whomever it may concern,

I am writing to seek clarification regarding the URI spec
(RFC3986) followed by HTTP, specifically about
percent-encoding arbitrary octets (which do not comprise a
valid UTF08 sequence). In the last paragraph of RFC3986
Section 2.5
(https://www.rfc-editor.org/rfc/rfc3986.html#section-2.5),
it says,  quote:

 >  When a new URI scheme defines a component that
represents textual     data consisting of characters
from the Universal Character Set  [UCS],
    the data should first be encoded as octets according
to the UTF-8     character encoding [STD63]; then only
those octets that do not     correspond to characters in
the unreserved set should be percent-     encoded.

This implies that URI schemes defined after RFC3986 must
follow UTF-8 encoding in their URIs. However, the original
HTTP/1.1 RFC (2616) was dated June 1999, and so would not
have had to "abide" by the UTF-8 rule.

In fact, many web servers allow and process GET requests
with percent-encoded octets, which they decode as raw bytes
and have the application level logic handle how to process
them.

However, since HTTP's latest RFC is 9110, dated June 2022
(post RFC3986), does it mean the UTF-8 rule now applies to
it? I would think not, since this would be a breaking
change. But some comments on github indicate that this is
as per the spec ()

Pointer?

My apologies, the comment is here: 
https://github.com/sindresorhus/got/issues/420#issuecomment-3
45416645

tl;dr - Is it compliant with the HTTP specification to send
arbitrary bytes, which do not represent a valid UTF-8
sequence, via percent-encoding in the URL query parameter?

Yes.

The http scheme was not re-definey by RFCs after RFC 2616
(in fact, it was defined even before that).

Best regards, Julian

Thanks for the clarification regarding schemes not being
re-defined. I  will ask the library author to reconsider

Regards,

Raghu Saxena

(P.S. Sorry for the personal reply prior to this - my first
time using  mailing lists)