On Mon, Apr 22, 2019 at 02:21:40PM -0600, Martin Fick wrote: > > Try this (with a recent version of git; your v1.8.2.1 won't have > > --batch-all-objects): > > > > # count the on-disk size of all objects > > git cat-file --batch-all-objects --batch-check='%(objectsize) > > %(objectsize:disk)' | perl -alne ' > > $repo += $F[0]; > > $disk += $F[1]; > > END { print "$repo / $disk = ", $repo/$disk } > > ' > > This has been running for a few hours now, I will update you with results when > its done. Hours? I think something might be wrong. It takes 20s to run on linux.git. > > 250:1 isn't inconceivable if you have large blobs which have small > > changes to them (and at 8GB for 8 million objects, you probably do have > > some larger blobs, since the kernel is about 1/8th the size for the same > > number of objects). > > I think it's mostly xml files in the 1-10MB range. Yeah, I could believe that would do it, then. Take a 10MB file with a hundred 1K updates. That'd be 99*1K + 10MB to store, for almost 100:1 compression. The key is really having objects where the size of change versus the file size is small (so the marginal cost of each revision is small), and then having lots of changes (to give you a big multiplier). > > So yes, if you really do have to hash 2TB of data, that's going to take > > a while. > > I was hoping I was wrong. Unfortunately I sense that this is not likely > something we can improve with a better algorithm. It seems like the best way > to handle this long term is likely to use BUP's rolling hash splitting, it > would make this way better (assuming it made objects small enough). I think it > is interesting that this approach might end up being effective for more than > just large binary file repos. If I could get this repo into bup somehow, it > could potentially show us if this would drastically reduce the index-pack > time. It fundamentally can't help without changing Git's object model, since you need to have complete hashes of all of those files. If you're proposing to do rolling hashes to split blobs into multiple objects that would definitely work. But it wouldn't be compatible with Git anymore. > > If you don't mind losing the collision-detection, using openssl's sha1 > > might help. The delta resolution should be threaded, too. So in _theory_ > > you're using 66 minutes of CPU time, but that should only take 1-2 > > minutes on your 56-core machine. I don't know at what point you'd run > > into lock contention, though. The locking there is quite coarse. > > I suspect at 3 threads, seems like the default? Ah, right, I forgot we cap it at 3 (which was determined experimentally, and which we more or less attributed to lock contention as the bottleneck). I think you need to use $GIT_FORCE_THREADS to override it. > I am running some index packs to test the theory, I can tell you already that > the 56 thread versions was much slower, it took 397m25.622s. I am running a > few other tests also, but it will take a while to get an answer. Since things > take hours to test, I made a repo with a single branch (and the tags for that > branch) from this bigger repo using a git init/git fetch. The single branch > repo takes about 12s to clone, but it takes around 14s with 3 threads to run > index-pack, any ideas why it is slower than a clone? Are you running it in the same repo, or in another newly-created repo? Or alternatively, in a new repo but repeatedly running index-pack? After the first run, that repo will have all of the objects. And so for each object it sees, index-pack will say "woah, we already had that one; let's double check that they're byte for byte identical" which carries extra overhead (and probably makes the lock contention way worse, too, because accessing existing objects just has one big coarse lock). So definitely do something like: for threads in 1 2 3 4 5 12 56; do rm -rf repo.git git init --bare repo.git GIT_FORCE_THREADS=$threads \ git -C repo.git index-pack -v --stdin </path/to/pack done to test. > Perhaps the locking can be improved here to make threading more effective? Probably, but easier said than done, of course. > > We also hash non-deltas while we're receiving them over the network. > > That's accounted for in the "receiving pack" part of the progress meter. > > If the time looks to be going to "resolving deltas", then that should > > all be threaded. > > Would it make sense to make the receiving pack time also threaded because I > believe that time is still longer than the I/O time (2 or 3 times)? It's a lot harder to thread since we're eating the incoming bytes. And unless you're just coming from a local disk copy, the network is generally the bottleneck (and if you are coming from a local disk copy, then consider doing a local clone which will avoid all of this hashing in the first place). -Peff