Re: [PATCH] diff-delta: produce optimal pack data

Nicolas Pitre <nico@xxxxxxx> · Fri, 24 Feb 2006 16:12:14 -0500 (EST)

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> On Fri, Feb 24, 2006 at 03:02:07PM -0500, Nicolas Pitre wrote:
> > Well that is probably a bit tight.  Ideally it should be linear with the 
> > size of the data set to process.  If you have 10 files 10MB each it 
> > should take about the same time to pack than 10000 files of 10KB each.  
> > Of course incrementally packing one additional 10MB file might take more 
> > than a second although it is only one file.
> 
> Well, I might not have been fair here.  I tried an experiment where I
> packed each of the twelve large blob objects explicitly one-by-one using
> git-pack-objects.  Incrementally packing each single object was very
> fast.  Well under a second per object on my machine.
> 
> After the twelve large objects were packed into individual packs the
> rest of the packing went very quickly and git v1.2.3's date reuse worked
> very well.  This was sort of my attempt at simulating how things would
> be if git avoided deltification of each of these large files. I'm sorry
> to have been so harsh earlier I just didn't understand that
> incrementally packing one-by-one was going to help this much.

Hmmmmmmm....

I don't think I understand what is going on here.

You say that, if you add those big files and incrementally repack after 
each commit using git repack with no option, then it requires only about 
one second each time.  Right?

But if you use "git-repack -a -f" then it is gone for more than an hour?

I'd expect something like 2 * (sum i for i = 1 to 10) i.e. in the 110 
second range due to the combinatorial effect when repacking everything.  
This is far from one hour and something appears to be really really 
wrong.

How many files besides those 12 big blobs do you have?

> This gives me hope that if somehow git were to not attempt to deltify
> these objects then performance would be much better than acceptible.
> 
> [snip]
> > However, if you could let me play with two samples of your big file I'd 
> > be grateful.  If so I'd like to make git work well with your data set 
> > too which is not that uncommon after all.
> 
> I would be happy to do this.  I will probably need to scrub a bit and
> make sure that the result shows the same characteristics.  How would you
> like me to deliver these files to you?  They are about 25 MB deflated.

If you can add them into a single .tgz with instructions on how 
to reproduce the issue and provide me with an URL where I can fetch it 
that'd be perfect.

Nicolas
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html