Re: Index format v5

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Sat, 19 May 2012 07:40:56 +0200

On 05/16/2012 11:54 PM, Thomas Gummerer wrote:
On 05/16, Michael Haggerty wrote:
I just reviewed version 1369bd855b86 of your script, and it is MUCH
better.  It's easy to read and review.  The functions that it
defines are now self-contained and could therefore be reused for
other purposes.  There are fewer magic numbers (though there are
still a few; I wonder if there is a way to get rid of those?)
You've done a nice job polishing up the code.

Thanks for the feedback! I could get rid of the magic numbers for
the crc code, but I'm not sure if it makes sense to replace the
others with constants, since they only occur once in the file. I
added comments instead explaining where those numbers come from
instead.

I think it is possible to remove the last magic number and also to make 
the CRC handling easier.  I have pushed some suggested changes to github 
[1]:

1. With the current code, trying to read a file that is less than 24 
bytes long would result in a struct.error (because it would try to 
unpack a string that is shorter than the struct) whereas the underlying 
error in this case should almost always be reported as a signature 
error.  So it is correct to read the signature separately from the rest 
of the header, but it is even more correct to check the signature 
(including its length) before reading on.

2. I introduce a class CRC to hold checksums, so that (a) the code for 
handling checksums can be encapsulated, and (b) an instance of this 
class can be passed into functions and mutated in-place, which is less 
cumbersome than requiring functions to return (data, crc) tuples.  This, 
in turn, makes possible...

3. ...a new function read_struct(f, s, crc), which reads the data for a 
struct.Struct from f, checksums it, and returns the unpacked data.  This 
function is more convenient to use than the old read_calc_crc().

4. The checksum instance can also be made responsible for verifying that 
the next four bytes in the file agree with the expected checksum.  This 
removes some more code duplication.  (See CRC.matches().)

5. You read the extension offsets using CRC_STRUCT.size, which is 
technically correct but misleading.  In fact, the extension offsets 
should be documented using their own EXTENSION_INDEX_STRUCT.  Also, it 
makes more sense to store the unpacked integer offsets to extoffsets 
rather than the raw 4-byte strings.

6. With a couple of more minor changes it is possible to replace the 
magic number "24" in read_index_entries().  With this change the 
computation is documented very explicitly and is also (somewhat) robust 
against future changes in the format.

Look over my changes and take whatever you want.

Thanks, I have changed those. I added the docstrings for all read
functions, they however don't seem to make sense for the print
functions, since you're probably faster just reading the code for
them.

That's fine.  If you document nontrivial functions you are already doing 
much better than the git project average.

One since I changed in addition is to those changes, is that I
gave the exceptions names to make them more meaningful.

Good.

Michael

[1] https://github.com/mhagger/git/tree/pythonprototype

--
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html