Re: [RFC] Add --create-cache to repack

Nicolas Pitre <nico@xxxxxxxxxxx> · Fri, 28 Jan 2011 13:46:33 -0500 (EST)

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> This started because I was looking for a way to speed up clones coming
> from a JGit server.  Cloning the linux-2.6 repository is painful, it
> takes a long time to enumerate the 1.8 million objects.  So I tried
> adding a cached list of objects reachable from a given commit, which
> speeds up the enumeration phase, but JGit still needs to allocate all
> of the working set to track those objects, then go find them in packs
> and slice out each compressed form and reformat the headers on the
> wire.  Its a lot of redundant work when your kernel repository has
> 360MB of data that you know a client needs if they have asked for your
> master branch with no "have" set.
> 
> Later I realized, we can get rid of that cached list of objects and
> just use the pack itself.  Its far cleaner, as there is no redundant
> cache.  But either way (object list or pack) its a bit of a challenge
> to automatically identify the right starting points to use.  Linus
> Torvalds' linux-2.6 repository is the perfect case for the RFC I
> posted, its one branch with all of the history, and it never rewinds.
> But maybe Linus is just very unique in this world.  :-)

Playing my old record again... I know.  But pack v4 should solve a big 
part of this enumeration cost.

I've changed the format slightly again in my WIP branch.  The idea is to:

1) Have a non compressed yet still really dense representation for tree 
   objects;

2) do the same thing for the first part of commit objects, and only 
   deflate the free form text part.

There is nothing new here.  However, it should be possible to:

3) replace all SHA1 references by an offset into the pack file directly, 
   just like we do for OFS_DELTA objects.  If the SHA1 is actually 
   needed then we can obtain it with a reverse lookup with given object offset 
   in the pack index file, but in practice that is not actually required that 
   often.

So walking the history graph and enumerating objects would require 
nothing more than simply following straight pointers in the pack data in 
99% of the cases.  No object decompression, no memory buffer 
allocation/deallocation to perform that decompression, no string parsing 
in the tree object case, etc. Only cross pack references would require a 
full SHA1 based lookup like we do now.

I still have to sit down and figure out the implications of this, 
especially with forward references, meaning that the offset might have 
to be an object index so to allow for variable length encoding, and also 
to make sure index-pack can reconstruct the pack index.  But that would 
only be an indirect lookup which shouldn't be significantly costly.

So that's the idea.  Keep the exact same functionality as we have now, 
without any need for cache management, but making the data structure in 
a form that should improve object enumeration by some magnitude.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html