On Mon, Apr 22 2019, Jeff King wrote: > On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > If you don't mind losing the collision-detection, using openssl's sha1 >> > might help. The delta resolution should be threaded, too. So in _theory_ >> > you're using 66 minutes of CPU time, but that should only take 1-2 >> > minutes on your 56-core machine. I don't know at what point you'd run >> > into lock contention, though. The locking there is quite coarse. >> >> There's also my (been meaning to re-roll) >> https://public-inbox.org/git/20181113201910.11518-1-avarab@xxxxxxxxx/ >> *that* part of the SHA-1 checking is part of what's going on here. It'll >> help a *tiny* bit, but of course is part of the "trust remote" risk >> management... > > I think we're talking about two different collision detections, and your > patch wouldn't help at all here. > > Your patch is optionally removing the "woah, we got an object with a > duplicate sha1, let's check that the bytes are the same in both copies" > check. But Martin's problem is a clone, so we wouldn't have any existing > objects to duplicate in the first place. Right, but we do anyway, as reported by Geert at @amazon.com preceding that patch of mine. But it is 99.99% irrelevant to *performance* in this case after the loose object cache you added (but before that could make all the difference depending on the FS). I just mentioned it to plant a flag on another bit of the code where index-pack in general has certain paranoias/validation the user might be willing to optionally drop just at "clone" time. > The problem in his case is literally just that the actual SHA-1 is > expensive, and that can be helped by using the optimized openssl > implementation rather than the sha1dc (which checks not collisions with > objects we _have_, but evidence of somebody trying to exploit weaknesses > in sha1). > > One thing we could do to make that easier is a run-time flag to switch > between sha1dc and a faster implementation (either openssl or blk_sha1, > depending on the build). That would let you flip the "trust" bit per > operation, rather than having it baked into your build. Yeah, this would be neat. > (Note that the oft-discussed "use a faster sha1 implementation for > checksums, but sha1dc for object hashing" idea would not help here, > because these really are object hashes whose time is dominating. We have > to checksum 8GB of raw packfile but 2TB of object data). > >> I started to write: >> >> I wonder if there's room for some tacit client/server cooperation >> without such a protocol change. >> >> E.g. the server sending over a pack constructed in such a way that >> everything required for a checkout is at the beginning of the >> data. Now we implicitly tend to do it mostly the other way around >> for delta optimization purposes. >> >> That would allow a smart client in a hurry to index-pack it as they >> go along, and as soon as they have enough to check out HEAD return >> to the client, and continue the rest in the background > > Interesting idea. You're not reducing the total client effort, but > you're improving latency of getting the user to a checkout. Of course > that doesn't help if they want to run "git log" as their first > operation. ;) > >> But realized I was just starting to describe something like 'clone >> --depth=1' followed by a 'fetch --unshallow' in the background, except >> that would work better (if you did "just the tip" naïvely you'd get >> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to >> write out two packs etc...). > > Right, that would work. I will note one thing, though: the total time to > do a 1-depth clone followed by an unshallow is probably much higher than > doing the whole clone as one unit, for two reasons: Indeed. The hypothesis is that the user doesn't really care about the clone-time, but the clone-to-repo-mostly-usable time. > 1. The server won't use reachability bitmaps when serving the > follow-up fetch (because shallowness invalidates the reachability > data they're caching), so it will spend much more time in the > "Counting objects" phase. > > 2. The server has to throw away some deltas. Imagine version X of a > file in the tip commit is stored as a delta against version Y in > that commit's parent. The initial clone has to throw away the > on-disk delta of X and send you the whole object (because you are > not requesting Y at all). And then in the follow-up fetch, it must > either send you Y as a base object (wasting bandwidth), or it must > on-the-fly generate a delta from Y to X (wasting CPU). > >> But at this point I'm just starting to describe some shoddy version of >> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow >> clone and fleshen right away" option. > > Yes. And partial-clone suffers from the problems above to an even > greater extent. ;) > >> On protocol extensions: Just having a way to "wget" the corresponding >> *.idx file from the server would be great, and reduce clone times by a >> lot. There's the risk of trusting the server, but most people's use-case >> is going to be pushing right back to the same server, which'll be doing >> a full validation. > > One tricky thing is that the server may be handing you a bespoke .pack > file. There is no matching ".idx" at all, neither in-memory nor on disk. > And you would not want the whole on-disk .pack/.idx pair from a site > like GitHub, where there are objects from many forks. > > So in general, I think you'd need some cooperation from the server side > to ask it to generate and send the .idx that matches the .pack it is > sending you. Or even if not the .idx format itself, some stable list of > sha1s that you could use to reproduce it without hashing each > uncompressed byte yourself. Yeah, depending on how jt/fetch-cdn-offload is designed (see my https://public-inbox.org/git/87k1hv6eel.fsf@xxxxxxxxxxxxxxxxxxx/) it could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx} around for such a purpose. So in such a case you'd serve up that recent-enough *.{pack,idx} for the client to "wget", and the client would then trust it (or not) and do the equivalent of a "fetch" from that point to be 100% up-to-date. > This could even be stuffed into the pack format and stripped out by > the receiving index-pack (i.e., each entry is prefixed with "and by > the way, here is my sha1..."). That would be really interesting. I.e. just having room for that (or anything else) in the pack format. I wonder if it could be added to the delta-chain in the current format as a nasty hack :) >> We could also defer that validation instead of skipping it. E.g. wget >> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes >> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" >> periodically to detect repository bitflips. >> >> Or, do some "narrow" validation of such an *.idx file right >> away. E.g. for all the trees/blobs required for the current checkout, >> and background the rest. > > The "do we have all of the objects we need" is already separate from > "figure out the sha1 of each object", so I think you'd get that > naturally if you just took in an untrusted .idx (it also demonstrates > that any .idx cost is really focusing on blobs, because the "do we have > all objects" check is going to decompress every commit and tree in the > repo anyway). > > -Peff