Re: [PATCH 2/2] Implement a simple delta_base cache

Junio C Hamano <junkio@xxxxxxx> · Sat, 17 Mar 2007 15:37:30 -0700

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> ie, we still re-generate some of the objects multiple times, but now, 
> rather than generating them (on average) 20+ times each, we now generate 
> them an average of just 1.3 times each. Which explains why the wall-time 
> goes down by over a factor of two.

This is beautiful.  You only cache what we were about to discard
anyway, and when giving a cached one out, you invalidate the
cached entry, so there is no way the patch can introduce leaks
nor double-frees and it is absolutely safe (as long as we can
pin the packed_git structure, which I think is the case --- even
when we re-read the packs, I do not think we discard old ones).

I've thought about possible ways to improve on it, but came up
almost empty.

When unpacking a depth-3 deltified object A, the code finds the
target object A (which is a delta), ask for its base B and put B
in the cache after using it to reconstitute A.  While doing so,
the first-generation base B is also a delta so its base C (which
is a non-delta) is found and placed in the cache.  When A is
returned, the cache has B and C.  If you ask for B at this
point, we read the delta, pick up its base C from the cache,
apply, and return while putting C back in the cache.  If you ask
for A after that, we do not read from the cache, although it is
available.

Which feels a bit wasteful at first sight, and we *could* make
read_packed_sha1() also steal from the cache, but after thinking
about it a bit, I am not sure if it is worth it.  The contract
between read_packed_sha1() and read_sha1_file() and its callers
is that the returned data belongs to the caller and it is a
responsibility for the caller to free the buffer, and also the
caller is free to modify it, so stealing from the cache from
that codepath means an extra allocation and memcpy.  If the
object stolen from the cache is of sufficient depth, it might be
worth it, but to decide it we somehow need to compute and store
which delta depth the cached one is at.

In any way, your code makes a deeply delitified packfiles a lot
more practical.  As long as the working set of delta chains fits
in the cache, after unpacking the longuest delta, the objects on
the chain can be had by one lookup and one delta application.

Very good job.

> In general, this all seems very cool. The patches are simple enough that I 
> think this is very safe to merge indeed: the only question I have is that 
> somebody should verify that the "struct packed_git *p" is stable over the 
> whole lifetime of a process - so that we can use it as a hash key without 
> having to invalidate hashes if we unmap a pack (I *think* we just unmap 
> the virtual mapping, and "struct packed_git *" stays valid, but Junio 
> should ack that for me).

Ack ;-)

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html