Here are some last call comments on
draft-ietf-appsawg-uri-scheme-reg-04. The review was started a while
ago, and completed, but the writeup took a lot of time and is still not
completed, sorry. I may be able to complete it tomorrow, but please
don't hold your breath.
[Just in case this is necessary, as a process point, I have seen various
tracker messages (such as the draft being placed on telechat, or that
the last call has ended), but I'd like to note that the Last Call
mentions an end date of 2015-03-12, and it's still 2015-03-12 here in
Japan, which means that this date has barely started in some other
locations around the world.]
My overall impression is that the overall direction of the draft is just
fine, but that presentation and wording are quite rough in many places
and would tremendously benefit from more careful wording.
Introduction: Overall, this felt too long, and it would benefit from
better structuring, and/or moving some of the points out to their own
sections/subsections. For example, adding subsection titles such as
"URIs and IRIs" and "Generic Syntax and Scheme Specific Syntax" or some
such would help quite a lot.
" o provide a central point of discovery for established URI scheme
names, and easy location of defining documents for standard
schemes;"
The use of the word "standard" in "standard scheme" is unclear. This use
doesn't appear anywhere in the document. Do you mean permanently
registered schemes? If there's some specific point to be made, please
make it more clearly. Otherwise, I suggest to just drop the word "standard".
"o discourage multiple separate uses of the same scheme name;"
I'd personally be happier if this said "strongly discourage", because I
hope we all agree that it's really not a good idea. If the consensus is
that it's obvious anyway that it's really not a good idea, and we don't
need to be overly clear about that, then I'll keep quiet.
"o encourage registration by setting a low barrier for registration."
What about making this "encourage early registration"?
"A URI scheme name is the same as the corresponding IRI scheme name."
At the minimum, I'd turn this around and say "An IRI scheme name is the
same as the corresponding URI scheme name." But because the there isn't
really anything like an "IRI scheme name", I'd actually prefer if this
said "IRIs use the same scheme names as URIs." or something similar.
"For example, this means that fragment identifiers (#) cannot be re-used
outside the generic syntax restrictions."
My 'best-guess' interpretation of this sentence is that this intended to
say that a scheme definition cannot define fragments that contain
characters (e.g. #) that RFC 3986 doesn't allow.
But this is bad advice, because scheme definitions cannot say anything
about fragments at all. This isn't syntax, but semantics; the semantics
of a fragment are defined by the media type, not the scheme. I haven't
found any place anywhere in this doc that says this, it clearly should
be added.
If you want to make an example re. syntax, I'd suggest to say something
like "For example, the query part cannot contain literal '#' characters
because they and anything after them would be interpreted as part of the
fragment and not the query." or some such.
Also, the "(#)" in the text is completely superfluous; the '#' itself
isn't the fragment, and a reader should be able to correlate the word in
the text and the same word in the ABNF.
"A scheme definition must specify the scheme name and the syntax of the
scheme-specific part, which is clarified as follows:"
Saying "clarified as follows" and then just giving some ABNF may be
difficult to grok for some people. I propose to change the sentence to
"A scheme definition must specify the scheme name and the syntax of the
scheme-specific part, which corresponds to the 'hier-part' and the
optional query in the above definition. This can be clarified by
rewriting the definition as follows:"
2. Terminology:
Within this document, the key words MUST, MAY, SHOULD, REQUIRED,
RECOMMENDED, and so forth are used within the general meanings
established in [RFC2119], within the context that they are
requirements on future registrations.
The double 'within' is confusing. I propose to replace "within the
context that they are requirements on future registrations" with "as
requirements on future registrations"
3. Requirements for Permanent Scheme Definitions
For
IETF Standards-Track documents, Permanent registration status is
REQUIRED.
Please change this to: "For URI Scheme definitions in IETF
Standards-Track documents, Permanent registration status is REQUIRED."
3.2. Syntactic Compatibility
Care must be taken to ensure
that all strings matching their scheme-specific syntax will also
match the <absolute-URI> grammar described in [RFC3986].
Pronouns like "their" don't usually work well in standard language. I
suggest changing "their scheme-specific syntax" to "the syntactic
restrictions of the scheme definition" or some such.
If there is a strong
reason for a scheme not to use the hierarchical syntax, then the new
scheme definition SHOULD follow the syntax of previously registered
schemes.
Please change "the syntax of previously registered schemes" to "the
syntax of previously registered schemes with similar components or
similar syntactic needs." or some such, to make it clear that it's not
sufficient to just copy some syntax if it's totally unrelated.
Schemes that are not intended for use with relative URIs SHOULD avoid
use of the forward slash "/" character, which is used for
hierarchical delimiters, and the complete path segments "." and ".."
(dot-segments).
It would be good if the text gave the reasons for the SHOULD (which I
fully agree with; maybe even a MUST).
Please add a(n informational) reference to Gettys, J., "URI Model
Consequences", <http://www.w3.org/DesignIssues/ModelConsequences> in
this section. It is a great text helping designers of URI scheme syntax
to understand the ideas regarding the different syntax components.
New schemes SHOULD clearly define the role of [RFC3986] reserved
characters in URIs of the scheme being defined.
The location of [RFC3986] is a bit strange. It might work if "[RFC3986]
reserved characters" is taken as a phrase, but it's difficult for the
reader to see that. Also, the specific topic is discussed in Section 2.2
of [RFC3986], so change the above to:
New schemes SHOULD clearly define the role of reserved characters
(see [RFC3986], Section 2.2) in URIs of the scheme being defined.
3.3. Well-Defined
and how legal
values in the base namespace, or legal protocol interactions, might
be represented in a valid URI.
"might be represented" -> "are represented"
See Section 3.6 for guidelines for
encoding binary or character strings within valid character sequences
in a URI .
"binary or character strings" -> "sequences of bytes or characters" (in
most contexts (programming languages,...), "character string" is
equivalent to "string", while "binary string" is undefined.)
Superfluous space before period.
If not all legal values or protocol interactions of the
base standard can be represented using the scheme, the definition
SHOULD be clear about which subset are allowed, and why.
"Which subset are" -> "which subset is" (or "which subsets are")
3.5. Context of Use
Most commonly, URIs are used as references to
resources within directories or hypertext documents, as hyperlinks to
other resources.
This sentence is totally unclear. Why do directories turn up here? Is
"resources within directories" and "other resources" parallels? Are
"references" and "hyperlinks" intended to be parallels? Is "references
to resources within ..." intended to mean "references to resources from
within ..." or "references to (resources within ...)"?. Please clarify.
3.6. Internationalization and Character Encoding
When describing schemes in which (some of) the elements of the URI
are actually representations of human-readable text, care should be
taken not to introduce unnecessary variety in the ways in which
characters are encoded into octets and then into URI characters; see
[RFC3987] and Section 2.5 of [RFC3986] for guidelines. If URIs of a
scheme contain any text fields, the scheme definition MUST describe
the ways in which characters are encoded and any compatibility issues
with IRIs of the scheme.
I think it would be extremely helpful to the average URI scheme
designer/describer if this section mentioned the use of UTF-8. The
reference to Section 2.5 of RFC 3986 is good, but the problem with that
section is that it starts out with very general and abstract language,
and one has to read through the whole section to find the relevant (and
extremely clear and appropriate) advice in the last paragraph.
At a minimum, please point the reader to the last paragraph of Section
2.5. Much better would be to include that paragraph verbatim (and saying
so explicitly):
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
The scheme specification SHOULD be as restrictive as possible
regarding what characters are allowed in the URI, because some
characters can create several different security considerations (see,
for example [RFC4690]).
I'm afraid that many people will read "as restrictive as possible" as
"well, let's just do ASCII only" or some such. I believe and hope that
this wasn't the intent, but I don't think this comes across. One kind of
improvement would be to change "as restrictive as possible" to just
"restrictive". Another is to change "as restrictive as possible" to "as
restrictive as possible without excluding characters outside US-ASCII".
"can create security considerations" sounds weird. The characters may
create security issues or security problems or some such, which may need
to be described in a security consideration section.
All percent-encoded variants are automatically included by definition
for any character given in an IRI production. This means that if you
want to restrict the URI percent-encoded forms in some way, you must
restrict the Unicode forms that would lead to them.
I know what you want to say here (I think it's the point originally
brought up by Björn Höhrmann in the IRI WG). But I think it's too
restrictive and can be worded better:
URI schemes that include textual data from Unicode have to be aware
that they have to define both the actual characters allowed (for
IRIs) and the corresponding percent-encoded forms (for URIs and
IRIs). This can be done in various ways, but in most cases, it is
advisable to define the actual characters allowed in an IRI
production, to allow the 'pct-encoded' definition from Section 2.1
of [RFC 3986] at the same places, and to add prose that limits
percent-escapes to those that can be created by converting valid
character sequences to percent-encoding via UTF-8.
Regards, Martin.