Re: [PATCH 1/7] revision-cache: define revision cache as simple list of revisions

Nicolas Pitre <nico@xxxxxxx> · Sun, 07 Jun 2009 01:43:58 -0400 (EDT)

On Sun, 7 Jun 2009, Sam Vilain wrote:

> On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote:
> > Please don't mess up the layers of abstractions.  Creating a list of 
> > objects is currently the biggest bottleneck for big clones.  It is not 
> > the object compression (delta) phase, it is not the pack creation 
> > either.
>   [...]
> > In other words, you should really concentrate on making 'git rev-list 
> > --objects' and 'git rev-list --objects-edge' instantaneous without 
> > having to parse any commit nor tree objects (at least for the part 
> > already covered by the cache).  And as new objects are added through 
> > 'git add' and 'git commit' or 'git fetch' then the traditional object 
> > parsing would have to take place only up to the point where the cache 
> > could take over, and update the cache for the non covered objects while 
> > at it.  I think this is already quite a difficult problem already.
> 
> It's a good thing, then, that this is exactly what the first milestones
> in the project are covering.  
> 
> What I don't understand though is that during initial clone the object
> compression phase does take a lot of time, and that there have been
> reports that large initial clones use a lot of VM for the packfile to be
> sent.

Did you duplicate those results yourself?

Did you also try out the simple bench test I suggested while responding 
to Jakub?

For example, let's take the Linux kernel repository as I have it handy 
here.  For the various phases I get:

Counting objects = 25 seconds
Compressing objects = 2 seconds
Writing objects = 3 seconds

A long compression phase is a sign of a really badly packed repository.
The solution to that is simply to repack.  After that subsequent clones 
as well as further repacks will have much shorter compression phase as 
the work that was performed in the first repack will be directly 
reusable.

> So what I'm doing is trying to answer the question about why we
> can't just start streaming the pack as soon as we know how many objects
> will be in it.  All those other stages can happen in parallel - for
> instance I would wager that it's far more efficient to detect loops as
> we're streaming, as the packfile is accessed, rather than having to read
> seek object header in the packfile up front.

My stance on this is that it is fundamentally impossible to do 
efficiently, if at all.  What we currently do is not loop detection but 
rather loop avoidance.  And because your pack set is not a static thing 
(new packs are created or fetched, and sometimes they're repacked into a 
single pack for a while) then you simply can't have a stable cache of 
object topology based on their storage into packs.  This is simply not 
some immutable data.  Only object names and contents are immutable, 
nothing more.

What I'm telling you is: in order to know how many objects the pack 
which is about to be created will contain, you need to count them.  
This is what takes 25 seconds currently in the above example.  Then a 2 
second compression phase which could be even less if your repository is 
better packed than mine currently is.  At which point the writing phase 
starts and the pack header is streamed.  So trying to stream the pack as 
soon as you know how many objects it contains will save a big whoopee 2 
seconds.  This is nothing to go crazy about, really.

Yet, to be able to stream a pack right away, you still would need to 
know _where_ to pick the pack data from.  And in most cases this is not 
from a nice single pack but rather multiple ones.  And some of those 
packs have objects which are duplicated into other packs.  And yet by 
mixing pack data from multiple source packs, you gain new opportunities 
for delta compression because you're now putting together objects which 
were separated before and therefore couldn't create delta between them 
previously (that's the 2 second phase above).  You still must be careful 
during that phase not to create delta loop, and to do that you must take 
into account all the other deltas from existing source packs that you 
are going to not recompute but just copy straight into the new pack.

And of course there is the delta depth limit to enforce.  This means 
that some rearrangements of deltas forces some objects which were deltas 
previously to become undeltified, or new deltas against a shallower base 
object.  And the only way to know what form each object will take in 
this context is again by going through the full compression phase which 
works on a list of object which ordering is totally different from the 
one that is used to actually store objects in the pack.  In other words, 
you cannot work out delta issues as you stream the pack because you 
don't want to write objects in the pack with the same order used for 
delta processing.

Why a different object ordering for storing in the pack? Because we want 
to create the best data layout possible in the new pack so future 
runtime access to that pack will be optimal.  This often means that 
objects are picked from existing packs in a non linear fashion but 
rather in a seemingly random way.  This is because we want the most 
recent commits to get the best IO patterns in a pack.  However, when the 
repository is updated through fetches or pushes, the newly obtained 
packs usually carry even more recent commits for which the bulk of 
referenced objects are still to be found in older packs (this is why a 
fetch is so quick).  So the more fetches or pushes are performed, the 
less efficient your repository becomes with regard to the most recent 
commits.  This is why a subsequent repack will reshuffle all objects so 
that the most recent commit gets a linear IO pattern again by picking 
objects from the most recent packs as well as from older packs and 
putting them all contiguously in the new pack.  But again you cannot 
know if those objects will be in delta form or not, or against which 
base object before the compression phase is over.

I hope you have a better idea now to answer the question about why we 
can't just start streaming the pack as soon as we know how many objects 
will be in it, and why all those other stages may not happen in 
parallel.

> In summary, I'm trying to make sure that the revision cache contains
> whatever information might help with making it start as soon as
> possible, where putting the information in is trivial.
> 
> If pack-objects integration is as small a win as you say - and I have no
> hard facts to dispute this - then when we come to that part and
> investigate I'm sure with the facts in hand we will agree on what
> approach to take.

Good.  I'm glad you see things that way.  Because, to me, even just the 
notion of a good revision cache is not that simple.

> > What pack-objects would greatly benefit from is a facility that could 
> > provide a list of objects with their SHA1, type, size
> 
> Which size?  compressed, uncompressed?  What does that win us?  Again
> I'm trying to support all information with a use case.

Uncompressed.  That information is used by the delta heuristics, and the 
code currently does its best not to seek all over through delta chains 
to fetch that tiny bit of information if that can be avoided (have a 
look at check_object() in builtin-pack-objects.c).

> > and the 32-bit 
> > object path hash
> 
> I'm not sure what this means, can you refer me to some docs or relevant
> source?

If you don't know what that is, then I'm afraid you might be lacking 
some background on the packing and delta strategy.  I'd suggest you read 
Documentation/technical/pack-heuristics.txt first, and then find out in 
the commit logs for builtin-pack-objects.c (or even within the file 
itself) what has changed since then.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html