Re: Gen-ART LC Review of draft-wilde-text-fragment-06

Martin Duerst <duerst@xxxxxxxxxxxxxxx> · Tue, 20 Feb 2007 18:55:03 +0900

Hello Spencer,

Many thanks for your comments.

At 05:57 07/02/20, Spencer Dawkins wrote:
>I have been selected as the General Area Review Team (Gen-ART)
>reviewer for this draft (for background on Gen-ART, please see
>http://www.alvestrand.no/ietf/gen/art/gen-art-FAQ.html).
>
>Please resolve these comments along with any other Last Call comments
>you may receive.
>
>Document: draft-wilde-text-fragment-06
>Reviewer: Spencer Dawkins
>Review Date:  2007-02-19
>IETF LC End Date: 2007-03-14
>IESG Telechat date: (if known)
>
>Summary:
>
>This document is almost ready for publication as a Proposed Standard RFC. Most of my questions below involve MAY/SHOULD/MUST requirements.
>
>Comments:
>
>I also included some (Nit)s, which are not part of the Gen-ART review but may be helpful for editors later in the process.
>
>Thanks,
>
>Spencer
>
>1.1.  What is text/plain?
>
>   The biggest advantage of text/plain MIME entities is their ease of
>   use and their portability among different platforms.  As long as they
>   use popular character encodings (such as US-ASCII or UTF-8), they can
>   be displayed and processed on virtually every computer system.  The
>   only remaining interoperability issue is the representation of line
>   endindings, which is discussed in Section 4.1.
>
>Spencer (Nit): s/endind/end/

Nice catch, thanks! Fixed.

>2.  Fragment Identification Methods
>
>   The identification of fragments of text/plain MIME entities can be
>   based on different foundations.  Since it is not possible to insert
>   explicit, invisible identifiers into a text/plain MIME entity (as for
>   example used in HTML documents, implemented through dedicated
>   attributes), fragment identification has to rely on certain inherent
>   properties of the MIME entity.  This memo specifies fragment
>   identification using six different methods, which are character
>   positions and ranges, line positions and ranges, regular expression
>   matching, and a mechanism for improving the robustness of fragment
>
>Spencer (Nit): I count five methods, plus the mechanism, which doesn't seem to actually identify a fragment.

Yeah, of course. Needs an 'and' before "regular expression matching",
but then it works out much better. Fixed.

>   identifiers (entity hashes).
>
>2.2.1.  Character Position
>
>   To identify a character position (i.e., a fragment of length zero
>   between two characters), the 'char' scheme followed by a single
>   number is used.  Rather than identifying a fragment consisting of a
>
>Spencer (Clarity): at least a couple of times, a description starts out "Rather than X, Y", and I found this confusing. I'd prefer to see "Y, rather than X", if this makes sense to the authors.

Fixed three or four occurrences (all the ones I found). I guess this is a
Germanism; the mother tongue of both authors is (Swiss) German :-).

>   number of characters, this method identifies a position between two
>   characters (or before the first or after the last character).
>   Character position counting starts with 0, so the character position
>   before the first character of a text/plain MIME entity has the
>   character position 0, and a MIME entity containing n distinct
>   characters has n+1 distinct character positions, the last one having
>   the character position n.
>
>2.5.  Fragment Identifier Robustness
>
>   Hash sums may specify the character encoding that has been used when
>   creating the hash sums, and if such a specification is present,
>   clients MUST check whether the character encoding specified for the
>   hash sum and the character encoding of the retrieved MIME entity are
>   equal, and clients MUST NOT check the hash sum if these values
>   differ.  However, clients MAY choose to transcode the retrieved MIME
>   entity in the case of differing character encodings, and after doing
>   so, check the hash sum.  Please note that this method is inhererently
>   unreliable, because certain characters or character sequences may
>   have been lost or normalized due to restrictions in one of the
>   character encodings used.
>
>Spencer: I have a concern about using MAY to allow clients to check reliability in an inherently unreliable way. I would prefer at least SHOULD NOT.

I agree that at first, this looks a bit scary, and in general, is a bad
idea. But I don't think this is a big concern in this case in practice.
The failure cases of this method are highly skewed towards false negatives
(transcoding back to what the charset information in the fragment ID says
doesn't match) as opposed to false positives (a match despite the fact that
the document has actually changed). This should be obvious for MD5 hashes,
and also applies to lenght 'hashes'. In the lenght case, there is a basic
risk of false positives independent of character encoding anyway
(the document gets changed, but with the same exact resulting length).

Do you agree that this can stay as is? Or do you think some wording
change would make it easier to understand that this as such isn't a
big risk?

>3.  Fragment Identification Syntax
>
>   The syntax for the fragment identifiers is straightforward.  The
>   syntax defines four schemes, 'char', 'line', 'match', and hash (which
>   can either be 'length' or 'md5').  The 'char' and 'line' schemes can
>   be used in two different variants, either the position variant (with
>   a single number), or the range variant (with two comma-separated
>   numbers).  The 'match' scheme has a regular expression as its
>   parameter, which must be specified as a string with escaped
>   semicolons (because the semicolon is used to concatenate multiple
>   fragment identification scheme parts).  The hash scheme can either
>   use the 'length' or the 'md5' scheme to specify a hash value.
>
>Spencer: The use of the word "hash" to describe the length of a resource in characters violates the Principle of Least Astonishment. Could "length" and "md5" not be grouped together, just for ease of understanding?

This is a good point. I'm a bit reluctant to make all the changes,
which would be quite extensive, but will try to do so if you insist.
An alternative is to make it much clearer in the text that talking about
length as a 'hash' may be misleading. (We really use it as a hash, but it
is only a very, very weak, but on the other side extremely cheap, hash).

>   The following syntax definition uses ABNF as defined in RFC 4234 [7],
>   including the rules DIGIT and HEXDIG.
>
>4.3.  Handling of Hash Sums
>
>   Clients are not required to implement the handling of hash sums, so
>   they MAY choose to ignore hash sum information altogether.  However,
>   if they do implement hash sum handling, the following applies:
>
>   If a fragment identifier contains a hash sum, and a client retrieves
>   a MIME entity and detects that the hash sum has changed (observing
>   the character encoding specification as described in Section 3.2, if
>   present), then the client SHOULD NOT interpret any other text/plain
>
>Spencer: why SHOULD NOT, and not MUST NOT?

In many cases (e.g. additions to the end of a file), the fragment id
may still be valid. In other cases (e.g. small edits shifting things
by a character or two), the user still may find the right place.
So going ahead is not always completely useless, and therefore we
wanted to give implementations some leeway to do what seems to
work best in their context (e.g. an interactive application vs.
something like an automatic extractor).

>   fragment identifier scheme part.  A client MAY signal this situation
>   to the user.
>
>4.4.  Syntax Errors in Fragment Identifiers
>
>   If a fragment identifier contains a syntax error (i.e., does not
>   conform to the syntax specified in Section 3), then it MUST be
>   ignored by clients.  Clients SHOULD NOT make any attempt to correct
>
>Spencer: again, why SHOULD NOT, and not MUST NOT?

This seems to be a valid point. We know from the HTML experience
that trying to fix things is a very slippery slope.

>   or guess fragment identifiers.  Syntax errors MAY be reported by
>   clients.
>
>5.  Examples
>
>   The following examples show some usages for the fragment identifiers
>   defined in this memo.
>
>Spencer: this section is very helpful. Thank you for including it.
>
>   ftp://example.com/text.txt#line=10,20;length=9876,UTF-8
>
>   As in the second example, this URI identifies lines 11 to 20 of the
>   text.txt MIME entity.  The additional length hash sum specifies that
>   the MIME entity has a length of 9876 characters when encoded in
>   UTF-8.  If the client supports the length hash sum scheme, it may
>   test the retrieved MIME entity for its length, but only if the
>   retrieved MIME entity uses the UTF-8 encoding or has been locally
>   trancoded into this encoding.  If the length of the retrieved MIME
>   entity does not match the length specified in the fragment
>   identifier, the client SHOULD NOT interpret the line part and MAY
>   signal this to the user.
>
>Spencer: this is the only example description that also includes normative text, which I believe is redundant anyway. I'd remove the last sentence from the description. 

Very valid point. Fixed as proposed.

Also, I have added your name to the Acknowledgements section.
Please tell me if you don't want this.

Many thanks again and kind regards,     Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@xxxxxxxxxxxxxxx     

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf