On Thu, Jan 25, 2018 at 12:06:59AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> Has anyone here barked up this tree before? Suggestions? Tips on where > >> to start hacking the repack code to accomplish this would be most > >> welcome. > > > > Does this overlap with the desire to have resumable clones? I'm > > curious what would happen if you did the same experiment with two > > separate clones of git/git, cloned one right after the other so that > > hopefully the upstream git/git didn't receive any updates between your > > two separate clones. (In other words, how much do packfiles differ in > > practice for different packings of the same data?) > > If you clone git/git from Github twice in a row you get the exact same > pack, and AFAICT this is true of git in general (but may change between > versions). That's definitely not guaranteed. It _tends_ to be the case over the short term because we use --threads=1 on the server. But it may differ if: - we repack on the server, which we do based on pushes - somebody pushes, even to another fork. The exact results depend on the packs in which we find the objects, and a new push may duplicate some existing objects but with a different representation, (e.g., a different delta base). I'm actually interested in adding an etags-like protocol extension that would work something like this: - server says "here's a pack, and its opaque tag is XYZ". - on resume, the client says "can I resume pack with tag XYZ"? - the server then decides if the on-disk state is sufficient for it to agree to recreate XYZ (e.g., number and identity of packs). If yes, then it resumes. If no, then it says "nope" and the two sides go through a normal fetch again. The important thing is that the tag is opaque to the client. So a stock implementation could use the on-disk state to decide. But a server could choose to cache the packs it sends for a period of time (especially if the client hangs up before we've sent the whole thing). We already do this to a limited degree at GitHub in order to efficiently serve multiple clients simultaneously fetching the same pack (e.g., imagine a fleet of AWS machines all triggering "git fetch" at once). I think that's a tangent to what you're looking for in this thread, though. -Peff