On Sat, Apr 20 2019, Jeff King wrote: > On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote: > >> I have been thinking about this problem, and I suspect that this compute time >> is actually spent doing SHA1 calculations, is that possible? Some basic back >> of the envelope math and scripting seems to show that the repo may actually >> contain about 2TB of data if you add up the size of all the objects in the >> repo. Some quick research on the net seems to indicate that we might be able >> to expect something around 500MB/s throughput on computing SHA1s, does that >> seem reasonable? If I really have 2TB of data, should it then take around >> 66mins to get the SHA1s for all that data? Could my repo clone time really be >> dominated by SHA1 math? > > That sounds about right, actually. 8GB to 2TB is a compression ratio of > 250:1. That's bigger than I've seen, but I get 51:1 in the kernel. > > Try this (with a recent version of git; your v1.8.2.1 won't have > --batch-all-objects): > > # count the on-disk size of all objects > git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' | > perl -alne ' > $repo += $F[0]; > $disk += $F[1]; > END { print "$repo / $disk = ", $repo/$disk } > ' > > 250:1 isn't inconceivable if you have large blobs which have small > changes to them (and at 8GB for 8 million objects, you probably do have > some larger blobs, since the kernel is about 1/8th the size for the same > number of objects). > > So yes, if you really do have to hash 2TB of data, that's going to take > a while. "openssl speed" on my machine gives per-second speeds of: > > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes > sha1 135340.73k 337086.10k 677821.10k 909513.73k 1007528.62k 1016916.65k > > So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about > the best you're going to do. And... > >> I mention 1.8.2.1 because we have many old machines which need this. However, >> I also tested this with git v2.18 and it actually is much slower even >> (~140mins). > > I think v2.18 will have the collision-detecting sha1 on by default, > which is slower. Building with OPENSSL_SHA1 should be the fastest (and > are those numbers above). Git's internal (but not collision detecting) > BLK_SHA1 is somewhere in the middle. > >> Any advice on how to speed up cloning this repo, or what to pursue more >> in my investigation? > > If you don't mind losing the collision-detection, using openssl's sha1 > might help. The delta resolution should be threaded, too. So in _theory_ > you're using 66 minutes of CPU time, but that should only take 1-2 > minutes on your 56-core machine. I don't know at what point you'd run > into lock contention, though. The locking there is quite coarse. There's also my (been meaning to re-roll) https://public-inbox.org/git/20181113201910.11518-1-avarab@xxxxxxxxx/ *that* part of the SHA-1 checking is part of what's going on here. It'll help a *tiny* bit, but of course is part of the "trust remote" risk management... > We also hash non-deltas while we're receiving them over the network. > That's accounted for in the "receiving pack" part of the progress meter. > If the time looks to be going to "resolving deltas", then that should > all be threaded. > > If you want to replay the slow part, it should just be index-pack. So > something like (with $old as a fresh clone of the repo): > > git init --bare new-repo.git > cd new-repo.git > perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack > perf report > > should show you where the time is going (substitute perf with whatever > profiling tool you like). > > As far as avoiding that work altogether, there aren't a lot of options. > Git clients do not trust the server, so the server sends only the raw > data, and the client is responsible for computing the object ids. The > only exception is a local filesystem clone, which will blindly copy or > hardlink the .pack and .idx files from the source. > > In theory there could be a protocol extension to let the client say "I > trust you, please send me the matching .idx that goes with this pack, > and I'll assume there was no bitrot nor trickery on your part". I > don't recall anybody ever discussing such a patch in the past, but I > think Microsoft's VFS for Git project that backs development on Windows > might do similar trickery under the hood. I started to write: I wonder if there's room for some tacit client/server cooperation without such a protocol change. E.g. the server sending over a pack constructed in such a way that everything required for a checkout is at the beginning of the data. Now we implicitly tend to do it mostly the other way around for delta optimization purposes. That would allow a smart client in a hurry to index-pack it as they go along, and as soon as they have enough to check out HEAD return to the client, and continue the rest in the background But realized I was just starting to describe something like 'clone --depth=1' followed by a 'fetch --unshallow' in the background, except that would work better (if you did "just the tip" naïvely you'd get 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to write out two packs etc...). $ rm -rf /tmp/git; time git clone --depth=1 https://chromium.googlesource.com/chromium/src /tmp/git; time git -C /tmp/git fetch --unshallow Cloning into '/tmp/git'... remote: Counting objects: 304839, done remote: Finding sources: 100% (304839/304839) remote: Total 304839 (delta 70483), reused 204837 (delta 70483) Receiving objects: 100% (304839/304839), 1.48 GiB | 19.87 MiB/s, done. Resolving deltas: 100% (70483/70483), done. Checking out files: 100% (302768/302768), done. real 2m10.223s user 1m2.434s sys 0m15.564s [not waiting for that second bit, but it'll take ages...] I think just having a clone mode that did that for you might scratch a lot of people's itch. I.e. "I want full history, but mainly want a checkout right away, so background the full clone". But at this point I'm just starting to describe some shoddy version of Documentation/technical/partial-clone.txt :), OTOH there's no "narrow clone and fleshen right away" option. On protocol extensions: Just having a way to "wget" the corresponding *.idx file from the server would be great, and reduce clone times by a lot. There's the risk of trusting the server, but most people's use-case is going to be pushing right back to the same server, which'll be doing a full validation. We could also defer that validation instead of skipping it. E.g. wget *.{pack,idx} followed by a 'fsck' in the background. I've sometimes wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" periodically to detect repository bitflips. Or, do some "narrow" validation of such an *.idx file right away. E.g. for all the trees/blobs required for the current checkout, and background the rest.