Re: pack v4 status

Nicolas Pitre <nico@xxxxxxx> · Tue, 27 Feb 2007 17:32:00 -0500 (EST)

On Tue, 27 Feb 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 27 Feb 2007, Shawn O. Pearce wrote:
> > 
> > We have thus far reformatted OBJ_TREEs with a new dictionary based
> > compression scheme.  In this scheme we pool the filenames and modes
> > that appear within trees into a single table within the packfile.
> > All trees are then converted to use a 22 byte record format:
> > 
> >   - 2 byte network byte order index into the string pool
> >   - 20 byte SHA-1
> 
> Umm. Am I missing something, or is this totally braindamaged?
> 
> Are you really expecting there to never be more than 64k basenames? Trust 
> me, that's a totally broken assumption. Anything that tracks generated 
> stuff will _easily_ have several tens of thousands of random filenames 
> even in a single tree, much less over the whole history of the repository.

The idea is to deal with only tree objects containing the 64K most 
frequently used base names and fall back to the current tree object 
encoding for objects that couldn't be represented that way.

For reference the GIT tree itself has 585 unique names.

The Linux kernel has 12263 of them.

If we eventually find it is common and performance critical to have more 
bits to represent those indices because the number of unique path 
components far exceeds that limit with an even distribution then we 
might just add another tree encoding with a 3-byte index for those.

In the end everything translates back to the same object.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html