Last Call: <draft-ietf-appsawg-uri-scheme-reg-04.txt> (Guidelines and Registration Procedures for URI Schemes) to Best Current Practice

"Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx> · Thu, 12 Mar 2015 19:05:17 +0900

Here are some last call comments on 
draft-ietf-appsawg-uri-scheme-reg-04. The review was started a while 
ago, and completed, but the writeup took a lot of time and is still not 
completed, sorry. I may be able to complete it tomorrow, but please 
don't hold your breath.

[Just in case this is necessary, as a process point, I have seen various 
tracker messages (such as the draft being placed on telechat, or that 
the last call has ended), but I'd like to note that the Last Call 
mentions an end date of 2015-03-12, and it's still 2015-03-12 here in 
Japan, which means that this date has barely started in some other 
locations around the world.]

My overall impression is that the overall direction of the draft is just 
fine, but that presentation and wording are quite rough in many places 
and would tremendously benefit from more careful wording.

Introduction: Overall, this felt too long, and it would benefit from 
better structuring, and/or moving some of the points out to their own 
sections/subsections. For example, adding subsection titles such as 
"URIs and IRIs" and "Generic Syntax and Scheme Specific Syntax" or some 
such would help quite a lot.

"  o  provide a central point of discovery for established URI scheme
      names, and easy location of defining documents for standard
      schemes;"
The use of the word "standard" in "standard scheme" is unclear. This use 
doesn't appear anywhere in the document. Do you mean permanently 
registered schemes? If there's some specific point to be made, please 
make it more clearly. Otherwise, I suggest to just drop the word "standard".

"o  discourage multiple separate uses of the same scheme name;"
I'd personally be happier if this said "strongly discourage", because I 
hope we all agree that it's really not a good idea. If the consensus is 
that it's obvious anyway that it's really not a good idea, and we don't 
need to be overly clear about that, then I'll keep quiet.

"o  encourage registration by setting a low barrier for registration."
What about making this "encourage early registration"?

"A URI scheme name is the same as the corresponding IRI scheme name."
At the minimum, I'd turn this around and say "An IRI scheme name is the 
same as the corresponding URI scheme name." But because the there isn't 
really anything like an "IRI scheme name", I'd actually prefer if this 
said "IRIs use the same scheme names as URIs." or something similar.

"For example, this means that fragment identifiers (#) cannot be re-used 
outside the generic syntax restrictions."
My 'best-guess' interpretation of this sentence is that this intended to 
say that a scheme definition cannot define fragments that contain 
characters (e.g. #) that RFC 3986 doesn't allow.

But this is bad advice, because scheme definitions cannot say anything 
about fragments at all. This isn't syntax, but semantics; the semantics 
of a fragment are defined by the media type, not the scheme. I haven't 
found any place anywhere in this doc that says this, it clearly should 
be added.

If you want to make an example re. syntax, I'd suggest to say something 
like "For example, the query part cannot contain literal '#' characters 
because they and anything after them would be interpreted as part of the 
fragment and not the query." or some such.

Also, the "(#)" in the text is completely superfluous; the '#' itself 
isn't the fragment, and a reader should be able to correlate the word in 
the text and the same word in the ABNF.

"A scheme definition must specify the scheme name and the syntax of the 
scheme-specific part, which is clarified as follows:"
Saying "clarified as follows" and then just giving some ABNF may be 
difficult to grok for some people. I propose to change the sentence to
"A scheme definition must specify the scheme name and the syntax of the 
scheme-specific part, which corresponds to the 'hier-part' and the 
optional query in the above definition. This can be clarified by 
rewriting the definition as follows:"

2. Terminology:

   Within this document, the key words MUST, MAY, SHOULD, REQUIRED,
   RECOMMENDED, and so forth are used within the general meanings
   established in [RFC2119], within the context that they are
   requirements on future registrations.
The double 'within' is confusing. I propose to replace "within the 
context that they are requirements on future registrations" with "as 
requirements on future registrations"

3.  Requirements for Permanent Scheme Definitions

                                                                     For
   IETF Standards-Track documents, Permanent registration status is
   REQUIRED.
Please change this to: "For URI Scheme definitions in IETF 
Standards-Track documents, Permanent registration status is REQUIRED."

3.2. Syntactic Compatibility

                                           Care must be taken to ensure
   that all strings matching their scheme-specific syntax will also
   match the <absolute-URI> grammar described in [RFC3986].

Pronouns like "their" don't usually work well in standard language. I 
suggest changing "their scheme-specific syntax" to "the syntactic 
restrictions of the scheme definition" or some such.

                                                   If there is a strong
   reason for a scheme not to use the hierarchical syntax, then the new
   scheme definition SHOULD follow the syntax of previously registered
   schemes.

Please change "the syntax of previously registered schemes" to "the 
syntax of previously registered schemes with similar components or 
similar syntactic needs." or some such, to make it clear that it's not 
sufficient to just copy some syntax if it's totally unrelated.

   Schemes that are not intended for use with relative URIs SHOULD avoid
   use of the forward slash "/" character, which is used for
   hierarchical delimiters, and the complete path segments "." and ".."
   (dot-segments).
It would be good if the text gave the reasons for the SHOULD (which I 
fully agree with; maybe even a MUST).

Please add a(n informational) reference to Gettys, J., "URI Model 
Consequences", <http://www.w3.org/DesignIssues/ModelConsequences> in 
this section. It is a great text helping designers of URI scheme syntax 
to understand the ideas regarding the different syntax components.

   New schemes SHOULD clearly define the role of [RFC3986] reserved
   characters in URIs of the scheme being defined.
The location of [RFC3986] is a bit strange. It might work if "[RFC3986] 
reserved characters" is taken as a phrase, but it's difficult for the 
reader to see that. Also, the specific topic is discussed in Section 2.2 
of [RFC3986], so change the above to:
   New schemes SHOULD clearly define the role of reserved characters
   (see [RFC3986], Section 2.2) in URIs of the scheme being defined.

3.3. Well-Defined

                                                        and how legal
   values in the base namespace, or legal protocol interactions, might
   be represented in a valid URI.
"might be represented" -> "are represented"

                                   See Section 3.6 for guidelines for
   encoding binary or character strings within valid character sequences
   in a URI .
"binary or character strings" -> "sequences of bytes or characters" (in 
most contexts (programming languages,...), "character string" is 
equivalent to "string", while "binary string" is undefined.)

Superfluous space before period.

               If not all legal values or protocol interactions of the
   base standard can be represented using the scheme, the definition
   SHOULD be clear about which subset are allowed, and why.
"Which subset are" -> "which subset is" (or "which subsets are")

3.5. Context of Use

                  Most commonly, URIs are used as references to
   resources within directories or hypertext documents, as hyperlinks to
   other resources.

This sentence is totally unclear. Why do directories turn up here? Is 
"resources within directories" and "other resources" parallels? Are 
"references" and "hyperlinks" intended to be parallels? Is "references 
to resources within ..." intended to mean "references to resources from 
within ..." or "references to (resources within ...)"?. Please clarify.

3.6. Internationalization and Character Encoding

   When describing schemes in which (some of) the elements of the URI
   are actually representations of human-readable text, care should be
   taken not to introduce unnecessary variety in the ways in which
   characters are encoded into octets and then into URI characters; see
   [RFC3987] and Section 2.5 of [RFC3986] for guidelines.  If URIs of a
   scheme contain any text fields, the scheme definition MUST describe
   the ways in which characters are encoded and any compatibility issues
   with IRIs of the scheme.

I think it would be extremely helpful to the average URI scheme 
designer/describer if this section mentioned the use of UTF-8. The 
reference to Section 2.5 of RFC 3986 is good, but the problem with that 
section is that it starts out with very general and abstract language, 
and one has to read through the whole section to find the relevant (and 
extremely clear and appropriate) advice in the last paragraph.

At a minimum, please point the reader to the last paragraph of Section 
2.5. Much better would be to include that paragraph verbatim (and saying 
so explicitly):

   When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.  For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".

   The scheme specification SHOULD be as restrictive as possible
   regarding what characters are allowed in the URI, because some
   characters can create several different security considerations (see,
   for example [RFC4690]).
I'm afraid that many people will read "as restrictive as possible" as 
"well, let's just do ASCII only" or some such. I believe and hope that 
this wasn't the intent, but I don't think this comes across. One kind of 
improvement would be to change "as restrictive as possible" to just 
"restrictive". Another is to change "as restrictive as possible" to "as 
restrictive as possible without excluding characters outside US-ASCII".

"can create security considerations" sounds weird. The characters may 
create security issues or security problems or some such, which may need 
to be described in a security consideration section.

   All percent-encoded variants are automatically included by definition
   for any character given in an IRI production.  This means that if you
   want to restrict the URI percent-encoded forms in some way, you must
   restrict the Unicode forms that would lead to them.

I know what you want to say here (I think it's the point originally 
brought up by Björn Höhrmann in the IRI WG). But I think it's too 
restrictive and can be worded better:

   URI schemes that include textual data from Unicode have to be aware
   that they have to define both the actual characters allowed (for
   IRIs) and the corresponding percent-encoded forms (for URIs and
   IRIs). This can be done in various ways, but in most cases, it is
   advisable to define the actual characters allowed in an IRI
   production, to allow the 'pct-encoded' definition from Section 2.1
   of [RFC 3986] at the same places, and to add prose that limits
   percent-escapes to those that can be created by converting valid
   character sequences to percent-encoding via UTF-8.

Regards,    Martin.