Re: [PATCHv2] Add details about svn-fe's dumpfile parsing

Andrew Sayers <andrew-git@xxxxxxxxxxxxxxx> · Mon, 16 Apr 2012 22:35:23 +0100

On 16/04/12 21:06, Junio C Hamano wrote:
> Andrew Sayers <andrew-git@xxxxxxxxxxxxxxx> writes:
>>  
>> +Unlike Subversion, 'svn-fe' interprets property key/value pairs as
>> +null-terminated binary strings.  This means it will accept content
>> +that Subversion normally wouldn't produce (such as filenames
>> +containing tab characters) or would refuse to parse (such as usernames
>> +containing Latin-1 characters).  However, like Subversion it will
>> +handle newlines incorrectly in filenames (see BUGS below).
>> +
> 
> Do the first two sentences in the above paragraph claim that it a bug that
> 'svn-fe' does not mimick what Subversion does?  I am not sure what lessons
> the authors of tools, whose output is meant to feed svn-fe, are expected
> to learn here.  For example, is the purpose of the above paragraph to make
> tool authors realize that "NUL terminates key and value, so I have to
> refrain from using a key or a value that contains a NUL byte?" [*1*]  Even
> in that case, it is unclear to me what I (as an author of such a tool that
> reads data from somewhere and format it to plesae svn-fe) could do with
> that knowledge.
> 
> [Footnote]
> 
> *1* By the way, NULL is a pointer that does not point anywhere.  The name
> of a byte whose value is 0x00 is NUL.

The dumpfile documentation says that "... property key/value pairs may
be interpreted as binary data in any encoding by client tools"[1], but
SVN itself interprets the data as UTF-8, so I was surprised to see
svn-fe hadn't aped that behaviour.  You could argue this is a bug if you
want to call `svnadmin` the reference implementation.  You could even
argue that treating NUL characters specially is a bug if you want to
call the documentation the official standard (albeit a bug shared by
`svnadmin`).  Personally I don't have a problem with either decision, so
I've just noted some unobvious behaviour.

Lessons to learn will depend on the author, but here are some I took:

1. UTF-8 is the most common encoding, but not the only one.  If your
tool only allows UTF-8 input and only produces only UTF-8 output then
you are the limiting factor in your toolchain.  In my case I think I'll
just live with that, but I would like to have known before I started.

2. `svn` itself isn't universally considered the reference
implementation, only a popular one.  When deciding the correct behaviour
(or the range of possible behaviours), it's not enough just to look at
what `svn` does.

3. `svn-fe` doesn't slavishly follow either the documentation or `svn`.
 Beyond a certain point you have to actually check assumptions against
svn-fe's behaviour (then document what you find so the next guy doesn't
have to ;).  Again, I think this is pragmatic but I would like to have
known earlier.

I've tried not to labour the above points, because it would be easy to
overstate them and because other authors will take different lessons.
It's certainly possible that some author would come to svn-fe wanting to
do something crazy like encode newlines as NULs and come away realising
C doesn't like that.  A more concrete example is that a GSoC student who
turns up next year to work on writing commits back to SVN will need to
know svn-fe chose not to care about UTF-8 and that there's a nest of
edge cases waiting for them.

	- Andrew

[1]http://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html