On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote: > > If you don't mind losing the collision-detection, using openssl's sha1 > > might help. The delta resolution should be threaded, too. So in _theory_ > > you're using 66 minutes of CPU time, but that should only take 1-2 > > minutes on your 56-core machine. I don't know at what point you'd run > > into lock contention, though. The locking there is quite coarse. > > There's also my (been meaning to re-roll) > https://public-inbox.org/git/20181113201910.11518-1-avarab@xxxxxxxxx/ > *that* part of the SHA-1 checking is part of what's going on here. It'll > help a *tiny* bit, but of course is part of the "trust remote" risk > management... I think we're talking about two different collision detections, and your patch wouldn't help at all here. Your patch is optionally removing the "woah, we got an object with a duplicate sha1, let's check that the bytes are the same in both copies" check. But Martin's problem is a clone, so we wouldn't have any existing objects to duplicate in the first place. The problem in his case is literally just that the actual SHA-1 is expensive, and that can be helped by using the optimized openssl implementation rather than the sha1dc (which checks not collisions with objects we _have_, but evidence of somebody trying to exploit weaknesses in sha1). One thing we could do to make that easier is a run-time flag to switch between sha1dc and a faster implementation (either openssl or blk_sha1, depending on the build). That would let you flip the "trust" bit per operation, rather than having it baked into your build. (Note that the oft-discussed "use a faster sha1 implementation for checksums, but sha1dc for object hashing" idea would not help here, because these really are object hashes whose time is dominating. We have to checksum 8GB of raw packfile but 2TB of object data). > I started to write: > > I wonder if there's room for some tacit client/server cooperation > without such a protocol change. > > E.g. the server sending over a pack constructed in such a way that > everything required for a checkout is at the beginning of the > data. Now we implicitly tend to do it mostly the other way around > for delta optimization purposes. > > That would allow a smart client in a hurry to index-pack it as they > go along, and as soon as they have enough to check out HEAD return > to the client, and continue the rest in the background Interesting idea. You're not reducing the total client effort, but you're improving latency of getting the user to a checkout. Of course that doesn't help if they want to run "git log" as their first operation. ;) > But realized I was just starting to describe something like 'clone > --depth=1' followed by a 'fetch --unshallow' in the background, except > that would work better (if you did "just the tip" naïvely you'd get > 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to > write out two packs etc...). Right, that would work. I will note one thing, though: the total time to do a 1-depth clone followed by an unshallow is probably much higher than doing the whole clone as one unit, for two reasons: 1. The server won't use reachability bitmaps when serving the follow-up fetch (because shallowness invalidates the reachability data they're caching), so it will spend much more time in the "Counting objects" phase. 2. The server has to throw away some deltas. Imagine version X of a file in the tip commit is stored as a delta against version Y in that commit's parent. The initial clone has to throw away the on-disk delta of X and send you the whole object (because you are not requesting Y at all). And then in the follow-up fetch, it must either send you Y as a base object (wasting bandwidth), or it must on-the-fly generate a delta from Y to X (wasting CPU). > But at this point I'm just starting to describe some shoddy version of > Documentation/technical/partial-clone.txt :), OTOH there's no "narrow > clone and fleshen right away" option. Yes. And partial-clone suffers from the problems above to an even greater extent. ;) > On protocol extensions: Just having a way to "wget" the corresponding > *.idx file from the server would be great, and reduce clone times by a > lot. There's the risk of trusting the server, but most people's use-case > is going to be pushing right back to the same server, which'll be doing > a full validation. One tricky thing is that the server may be handing you a bespoke .pack file. There is no matching ".idx" at all, neither in-memory nor on disk. And you would not want the whole on-disk .pack/.idx pair from a site like GitHub, where there are objects from many forks. So in general, I think you'd need some cooperation from the server side to ask it to generate and send the .idx that matches the .pack it is sending you. Or even if not the .idx format itself, some stable list of sha1s that you could use to reproduce it without hashing each uncompressed byte yourself. This could even be stuffed into the pack format and stripped out by the receiving index-pack (i.e., each entry is prefixed with "and by the way, here is my sha1..."). > We could also defer that validation instead of skipping it. E.g. wget > *.{pack,idx} followed by a 'fsck' in the background. I've sometimes > wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" > periodically to detect repository bitflips. > > Or, do some "narrow" validation of such an *.idx file right > away. E.g. for all the trees/blobs required for the current checkout, > and background the rest. The "do we have all of the objects we need" is already separate from "figure out the sha1 of each object", so I think you'd get that naturally if you just took in an untrusted .idx (it also demonstrates that any .idx cost is really focusing on blobs, because the "do we have all objects" check is going to decompress every commit and tree in the repo anyway). -Peff