On Mon, Oct 31, 2011 at 04:07:18AM -0500, Jonathan Nieder wrote: > Something like Jeff's "priming the well with a server-specified > bundle" proposal[2] might be a good way to make the same trick > transparent to clients in the future. Yes, that is one of the use cases I hope to address. But it will require the publisher specifying a mirror location (it's possible we could add some kind of automagic "hit a bundler service first" config option, though I fear that the existing small-time bundler services would crumble under the load). So in the general case (and in the meantime), you may have to learn to manually prime the repo using a bundle. I haven't started on the patches for communicating mirror sites between the server and client, but I did just write some patches to handle "git fetch http://host/path/to/file.bundle" automatically, which is the first step. They need a few finishing touches and some testing, though. > Even with that, later fetches, which grab a pack generated on the fly > to only contain the objects not already fetched, are generally not > resumable. Overcoming that would presumably require larger protocol > changes, and I don't know of anyone working on it. (My workaround > when in a setup where this mattered was to use the old-fashioned > "dumb" http protocol. It worked fine.) My goal was for the mirror communication between client and server to be something like: - if you don't have object XXXXXX, then prime with URL http://host/bundle1 - if you don't have object YYYYYY, then prime with URL http://host/bundle2 and so forth. A cloning client would grab the first bundle, then the second, and then hit the real repo via the git protocol. A client who had previously cloned might have XXX, but would now grab bundle2, and then hit the real repo. So depending on how often the server side feels like creating new bundles, you would get most of the changes via bundles, and then only be getting a small number of objects via git. The downside of cumulative fetching is that the bundles can only serve well-known checkpoints. So if you have a timeline like this: t0: server publishes bundle/mirror config with one line (the XXX bit above) t1: you clone, getting the whole bundle. No waste, because you had nothing in the first place, and you needed everything. t2: you fetch again, getting N commits worth of history via the git protocol t3: server decides a lot of new objects (let's say M commits worth) have accumulated, and generates a new line (the YYY line). t4: you fetch, see that you don't yet have YYY, and grab the second bundle But in t4 you grabbed a bundle containing M commits, when you already had the first N of them. So you actually wasted bandwidth getting objects you already had. The only benefit is that you grabbed a static file, which is resumable. So I suspect there is some black magic involved in deciding when to create a new bundle, and at what tip. If you create a bundle once a month, but include only commits up to a week ago, then people pulling weekly will never grab the bundle, but people pulling less frequently will get the whole month as a bundle. A secondary issue is also that in a scheme like this, your mirror list will grow without bound. So you'd want to periodically repack everything into a single bundle. But then people who are fetching wouldn't want that, as it is just an exacerbated version of the same problem above. Which is all a roundabout way of saying that the git protocol is really the sane way to do efficient transfers. An alternative, much simpler scheme would be for the server to just say: - if you have nothing, then prime with URL http://host/bundle And then _only_ clone would bother with checking mirrors. People doing fetch would be expected to do it often enough that not being resumable isn't a big deal. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html