Re: [PATCH 2/2] Implement a simple delta_base cache

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 17 Mar 2007 16:09:57 -0700 (PDT)

On Sat, 17 Mar 2007, Junio C Hamano wrote:
> 
> When unpacking a depth-3 deltified object A, the code finds the
> target object A (which is a delta), ask for its base B and put B
> in the cache after using it to reconstitute A.  While doing so,
> the first-generation base B is also a delta so its base C (which
> is a non-delta) is found and placed in the cache.  When A is
> returned, the cache has B and C.  If you ask for B at this
> point, we read the delta, pick up its base C from the cache,
> apply, and return while putting C back in the cache.  If you ask
> for A after that, we do not read from the cache, although it is
> available.

Yes.

I debated that a bit with myself, but decided that:

 (a) it probably doesn't really matter a lot (but I don't have the 
     numbers)

 (b) trying to *also* fill non-delta-base queries from the delta-base 
     cache actually complicates things a lot. Surprisingly much so (the 
     current logic of removing the entry from the cache only to re-insert 
     it after being used made the memory management totally trivial, as 
     you noticed)

 (c) and regardless, we could decide to do a more extensive caching layer 
     later if we really wanted to, and at that point it probably makes 
     more sense to integrate it with the delta-base cache.

     Most git objects are use-once, which is why we really *just* save the 
     flag bits and the SHA1 hash name itself in "struct object", but doing 
     a generic caching layer for object content would likely obviate the 
     need for the current logic to do "save_commit_buffer".

That (c) in particular was what made me think that it's better to keep it 
simple and obvious for now, since even the simple thing largely fixes the 
performance issue.  Almost three seconds I felt bad about, while just over 
a second for something as complex as "git log drivers/usb/" I just cannot 
make myself worry about.

> In any way, your code makes a deeply delitified packfiles a lot
> more practical.  As long as the working set of delta chains fits
> in the cache, after unpacking the longuest delta, the objects on
> the chain can be had by one lookup and one delta application.

Yeah. I think it would be good to probably (separately and as "further 
tweaks"):

 - have somebody actually look at hit-rates for different repositories and 
   hash sizes.

 - possibly allow people to set the hash size as a config option, if it 
   turns out that certain repository layouts or usage scenarios end up 
   preferring bigger caches.

   For example, it may be that for historical archives you might want to 
   have deeper delta queues to make the repository smaller, and if they 
   are big anyway maybe they would prefer to have a larger-than-normal 
   cache as a result. On the other hand, if you are memory-constrained, 
   maybe you'd prefer to re-generate the objects and waste a bit of CPU 
   rather than cache the results.

But neither of the above is really an argument against the patch, just a 
"there's certainly room for more work here if anybody cares".

> Very good job.

I'm pretty happy with the results myself. Partly because the patches just 
ended up looking so *nice*.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html