Re: Resolving deltas dominates clone time

Martin Fick <mfick@xxxxxxxxxxxxxx> · Mon, 22 Apr 2019 14:21:40 -0600

On Friday, April 19, 2019 11:58:25 PM MDT Jeff King wrote:
> On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:
> > I have been thinking about this problem, and I suspect that this compute
> > time is actually spent doing SHA1 calculations, is that possible? Some
> > basic back of the envelope math and scripting seems to show that the repo
> > may actually contain about 2TB of data if you add up the size of all the
> > objects in the repo. Some quick research on the net seems to indicate
> > that we might be able to expect something around 500MB/s throughput on
> > computing SHA1s, does that seem reasonable? If I really have 2TB of data,
> > should it then take around 66mins to get the SHA1s for all that data?
> > Could my repo clone time really be dominated by SHA1 math?
> 
> That sounds about right, actually. 8GB to 2TB is a compression ratio of
> 250:1. That's bigger than I've seen, but I get 51:1 in the kernel.
> 
> Try this (with a recent version of git; your v1.8.2.1 won't have
> --batch-all-objects):
> 
>   # count the on-disk size of all objects
>   git cat-file --batch-all-objects --batch-check='%(objectsize)
> %(objectsize:disk)' | perl -alne '
>     $repo += $F[0];
>     $disk += $F[1];
>     END { print "$repo / $disk = ", $repo/$disk }
>   '

This has been running for a few hours now, I will update you with results when 
its done.

> 250:1 isn't inconceivable if you have large blobs which have small
> changes to them (and at 8GB for 8 million objects, you probably do have
> some larger blobs, since the kernel is about 1/8th the size for the same
> number of objects).

I think it's mostly xml files in the 1-10MB range.

> So yes, if you really do have to hash 2TB of data, that's going to take
> a while.

I was hoping I was wrong. Unfortunately I sense that this is not likely 
something we can improve with a better algorithm. It seems like the best way 
to handle this long term is likely to use BUP's rolling hash splitting, it 
would make this way better (assuming it made objects small enough). I think it 
is interesting that this approach might end up being effective for more than 
just large binary file repos. If I could get this repo into bup somehow, it 
could potentially show us if this would drastically reduce the index-pack 
time.

> I think v2.18 will have the collision-detecting sha1 on by default,
> which is slower.

Makes sense.

> If you don't mind losing the collision-detection, using openssl's sha1
> might help. The delta resolution should be threaded, too. So in _theory_
> you're using 66 minutes of CPU time, but that should only take 1-2
> minutes on your 56-core machine. I don't know at what point you'd run
> into lock contention, though. The locking there is quite coarse.

I suspect at 3 threads, seems like the default?

I am running some index packs to test the theory, I can tell you already that 
the 56 thread versions was much slower, it took 397m25.622s. I am running a 
few other tests also, but it will take a while to get an answer. Since things 
take hours to test, I made a repo with a single branch (and the tags for that 
branch) from this bigger repo using a git init/git fetch. The single branch 
repo takes about 12s to clone, but it takes around 14s with 3 threads to run 
index-pack, any ideas why it is slower than a clone?

Here are some thread times for the single branch case:

 Threads  Time
 56           49s
 12           34s
 5             20s
 4             15s
 3             14s
 2             17
 1             30

So 3 threads appears optimal in this case.

Perhaps the locking can be improved here to make threading more effective?

> We also hash non-deltas while we're receiving them over the network.
> That's accounted for in the "receiving pack" part of the progress meter.
> If the time looks to be going to "resolving deltas", then that should
> all be threaded.

Would it make sense to make the receiving pack time also threaded because I 
believe that time is still longer than the I/O time (2 or 3 times)?

> If you want to replay the slow part, it should just be index-pack. So
> something like (with $old as a fresh clone of the repo):
> 
>   git init --bare new-repo.git
>   cd new-repo.git
>   perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
>   perf report
> 
> should show you where the time is going (substitute perf with whatever
> profiling tool you like).

I will work on profiling soon, but I wanted to give an update now.

Thanks for the great feedback,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation