On Wed, Feb 01, 2017 at 10:06:15AM -0800, Junio C Hamano wrote: > > If you _can_ do that latter part, and you take "I only care about > > resumability" to the simplest extreme, you'd probably end up with a > > protocol more like: > > > > Client: I need a packfile with this want/have > > Server: OK, here it is; its opaque id is XYZ. > > ... connection interrupted ... > > Client: It's me again. I have up to byte N of pack XYZ > > Server: OK, resuming > > [or: I don't have XYZ anymore; start from scratch] > > > > Then generating XYZ and generating that bundle are basically the same > > task. > > The above allows a simple and naive implementation of generating a > packstream and "tee"ing it to a spool file to be kept while sending > to the first client that asks XYZ. > > The story I heard from folks who run git servers at work for Android > and other projects, however, is that they rarely see two requests > with want/have that result in an identical XYZ, unless "have" is an > empty set (aka "clone"). In a busy repository, between two clone > requests relatively close together, somebody would be pushing, so > you'd need many XYZs in your spool even if you want to support only > the "clone" case. Yeah, I agree a tag "XYZ" does not cover all cases, especially for fetches. We do caching at GitHub based on the sha1(want+have+options) tag, and it does catch quite a lot of parallelism, but not all. It catches most clones, and many fetches that are done by "thundering herds" of similar clients. One thing you could do with such a pure "resume XYZ" tag is to represent the generated pack _without_ replicating the actual object bytes, but take shortcuts by basing particular bits on the on-disk packfile. Just enough to serve a deterministic packfile for the same want/have bits. For instance, if the server knew that XYZ meant - send bytes m through n of packfile p, then... - send the object at position i of packfile p, as a delta against the object at position j of packfile q - ...and so on Then you could store very small "instruction sheets" for each XYZ that rely on the data in the packfiles. If those packfiles go away (e.g., due to a repack) that invalidates all of your current XYZ tags. That's OK as long as this is an optimization, not a correctness requirement. I haven't actually built anything like this, though, so I don't have a complete language for the instruction sheets, nor numbers on how big they would be for average cases. > So in the real life, I think that the exchange needs to be more > like this: > > C: I need a packfile with this want/have > ... C/S negotiate what "have"s are common ... > S: Sorry, but our negitiation indicates that you are way too > behind. I'll send you a packfile that brings you up to a > slightly older set of "want", so pretend that you asked for > these slightly older "want"s instead. The opaque id of that > packfile is XYZ. After getting XYZ, come back to me with > your original set of "want"s. You would give me more recent > "have" in that request. > ... connection interrupted ... > C: It's me again. I have up to byte N of pack XYZ > S: OK, resuming (or: I do not have it anymore, start from scratch) > ... after 0 or more iterations C fully receives and digests XYZ ... > > and then the above will iterate until the server does not have to > say "Sorry but you are way too behind" and returns a packfile > without having to tweak the "want". Yes, I think that is a reasonable variant. The client knows about seeding, but the XYZ conversation continues to happen inside the git protocol. So it loses flexibility versus a true CDN redirection, but it would "just work" when the server/client both understand the feature, without the server admin having to set up a separate bundle-over-http infrastructure. -Peff