On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Your patch is optionally removing the "woah, we got an object with a > > duplicate sha1, let's check that the bytes are the same in both copies" > > check. But Martin's problem is a clone, so we wouldn't have any existing > > objects to duplicate in the first place. > > Right, but we do anyway, as reported by Geert at @amazon.com preceding > that patch of mine. But it is 99.99% irrelevant to *performance* in this > case after the loose object cache you added (but before that could make > all the difference depending on the FS). I scratched my head at this a bit. If we don't have any other objects, then what are we comparing against? But I think you mean that we have the overhead of doing the object lookups to find that out. Yes, that can add up if your filesystem has high latency, but I think in this case it is a drop in the bucket compared to dealing with the actual object data. > I just mentioned it to plant a flag on another bit of the code where > index-pack in general has certain paranoias/validation the user might be > willing to optionally drop just at "clone" time. Yeah, I agree it may be worth pursuing independently. I just don't think it will help Martin's case in any noticeable way. > > Right, that would work. I will note one thing, though: the total time to > > do a 1-depth clone followed by an unshallow is probably much higher than > > doing the whole clone as one unit, for two reasons: > > Indeed. The hypothesis is that the user doesn't really care about the > clone-time, but the clone-to-repo-mostly-usable time. There was a little bit of self-interest in there for me, too, as a server operator. While it does add to the end-to-end time, most of the resource use for the shallow fetch gets put on the server. IOW, I don't think we'd be happy to see clients doing this depth-1-and-then-unshallow strategy for every clone. > > So in general, I think you'd need some cooperation from the server side > > to ask it to generate and send the .idx that matches the .pack it is > > sending you. Or even if not the .idx format itself, some stable list of > > sha1s that you could use to reproduce it without hashing each > > uncompressed byte yourself. > > Yeah, depending on how jt/fetch-cdn-offload is designed (see my > https://public-inbox.org/git/87k1hv6eel.fsf@xxxxxxxxxxxxxxxxxxx/) it > could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx} > around for such a purpose. > > So in such a case you'd serve up that recent-enough *.{pack,idx} for the > client to "wget", and the client would then trust it (or not) and do the > equivalent of a "fetch" from that point to be 100% up-to-date. I think it's sort of orthogonal. Either way you have to teach the client how to get a .pack/.idx combo. Whether it learns to receive it inline from the first fetch, or whether it is taught to expect it from the out-of-band fetch, most of the challenge is the same. > > This could even be stuffed into the pack format and stripped out by > > the receiving index-pack (i.e., each entry is prefixed with "and by > > the way, here is my sha1..."). > > That would be really interesting. I.e. just having room for that (or > anything else) in the pack format. > > I wonder if it could be added to the delta-chain in the current format > as a nasty hack :) There's definitely not "room" in any sense of the word in the pack format. :) However, as long as all parties agreed, we can stick whatever we want into the on-the-wire format. So I was imagining something more like: 1. pack-objects learns a --report-object-id option that sticks some additional bytes before each object (in its simplest form, $obj_hash bytes of id) 2. likewise, index-pack learns a --parse-object-id option to receive it and skip hashing the object bytes 3. we get a new protocol capability, "send-object-ids". If the server advertises and the client requests it, then both sides turn on the appropriate option You could even imagine generalizing it to "--report-object-metadata", and including 0 or more metadata packets before each object. With object id being one, but possibly other computable bits like "generation number" after that. I'm not convinced other metadata is worth the space/time tradeoff, though. After all, this is stuff that the client _could_ generate and cache themselves, so you're trading off bandwidth to save the client from doing the computation. Anyway, food for thought. :) -Peff