Re: cleaner/better zlib sources?

Nicolas Pitre <nico@xxxxxxx> · Fri, 16 Mar 2007 13:06:44 -0400 (EDT)

On Fri, 16 Mar 2007, Linus Torvalds wrote:

> The most performance-critical objects for uncompression are commits and 
> trees. At least for the kernel, the average size of a tree object is 678
> bytes. And that's ignoring the fact that most of them are then deltified, 
> so about 80% of them are likely just a ~60-byte delta.

This is why in pack v4 there will be an alternate tree object 
representation which is not deflated at all.

In short we intend to have 3 tables where common things are factored 
out:

 1) the path component string table (deflated)

 2) author/committer string table (deflated)

 3) sorted SHA1 table (obviously not deflated)

The sorted SHA1 table will be part of the pack instead of being in the 
pack index.  The idea is that most SHA1's are already duplicated in the 
pack already anyway within commit and tree objects.  With a single table 
then commit and tree objects can index into that SHA1 table rather than 
providing the SHA1 value inline for the objects they refer to.

This means that a tree object record would be only 6 bytes according to 
the current design: 2 bytes to index into the path component string 
table (which also include the mode information), and 4 bytes to index 
into the sorted SHA1 table.  And similarly for commit objects.

This means that the pack index will only have a table of offsets 
corresponding to the table of sorted SHA1's.

So... walking revisions will become only a matter of picking the first 
commit object, using the tree index value (which is not deflated), but 
instead of using it in the SHA1 table it could be used in the offset 
table to find the location of the corresponding tree object directly.  
Same goes for tree entries, or for locating the parent's commit object.

No deflating, no binary searching, no SHA1 comparisons.  Plain straight 
pointer dereference.

Then, if you want to filter tree walking on path spec, you only need to 
locate the path component in the path table once and use the 
corresponding index to filter tree entries instead of repeated strcmp().  
Same thing if you want to filter commits based on author/committer.  
One side effect of this is that you can tell straight away that a path 
doesn't exist in the whole pack if one of its components cannot be found 
in the table (that works only if no legacy tree representations are 
present of course).  That should make history walking blazingly fast.

The only thing that gets deflated is the commit message which needs to 
be inflated only when displaying it.

And so far that makes for quite smaller packs too!

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html