Re: upload-pack packfile caching

Nicolas Pitre <nico@xxxxxxx> · Tue, 16 Sep 2008 16:59:31 -0400 (EDT)

On Tue, 16 Sep 2008, Scott Chacon wrote:

> I was wondering if it would be of general interest to have upload-pack
> take an option to cache packfiles.  Unless I am mistaken, every clone
> on a git server will recreate the same packfile until something new is
> pushed into it, correct?  I thought it might be a good idea to pass an
> option to have it cache the packfile that is created if
> create_full_pack is set and re-use it until the repository is updated.
>  If I patched upload-pack to do this, would there be any interest in
> it?

Well, if you do that there are a few things to be careful about.

First, having a server process able to write files is a security hazard.  
If you want to create a pack cache then it is best if created manually 
by the repository owner.  You don't want someone cloning a repository 
actually messing with such cache.

Secondly, the dynamic creation of a pack currently take into account the 
capabilities of the client so not to produce a pack with features that 
the client does not support.  So in order not to have to cache pack with 
many feature combinations, this cache should probably only take effect 
if pack capabilities of the server are also supported by the client.

Now, the _only_ advantage of a cached pack file is in avoiding execution 
of rev-list.  Otherwise creation of a pack for streaming is almost 
identical to straight copying of data from disk due to pack data reuse.  
The rev-list can be made faster by having the pack-objects process do 
the object listing itself instead of piping the output from rev-list 
into it ('git repack' does that but 'git-upload-pack' doesn't).  And I 
believe that rev-list could be made much much faster with pack v4.

That been said...

What you could have is a simple file with 2 SHA1s: the first 
corresponding to the output of 'git for-each-ref' and the second one 
corresponding to the list of all objects reachable from those refs.

For example:

1) git for-each-ref --format="%(objectname)" --sort=objectname | sha1sum

2) git for-each-ref --format="%(objectname)" | \
   xargs git rev-list --objects | cut -c -40 | sort | sha1sum

So, if you do the above in a freshly cloned repository, you'll find that 
the SHA1 in 2) corresponds to this:

3) git show-index < .git/objects/pack/pack-*.idx | cut -f2 -d' ' | sha1sum

which means that all objects reachable from all refs are found in the 
only pack you have.

Now, if the SHA1 in 2) is computed over the binary representation of all 
those object names, you'll find out that it corresponds to the actual 
pack name in the .git/objects/pack/ directory.

So what upload-pack could do is look for a special file with those 2 
SHA1s, and if it exists then see if the first SHA1 matches the list of 
values for all refs, if so then the name of the pack to send out 
corresponds to the second SHA1.  If that pack is found in the repository 
then you just have to stream it out.

Creating that file is then just a matter of doing the equivalent of the 
above commands and repacking your repository 
into a single pack.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html