Martin Fick <mfick@xxxxxxxxxxxxxx> writes: > Android consists of approximately 400 projects, and if > anyone tags their builds regularly, that means that they > will be tagging 400 projects per build. We have something > like 10K tags on average across 400 projects, so when we do > a simple No Op sync just to see if all 400 projects are up > to date, this yields about 200MB of data over the wire (4M > refs)!!! > ... > As you can imagine, we really would like to see this > eliminated from our workflows. If we want to check 400 > projects to see if they are up to date, it should be 400 > refs, not 400 * 10K. I think we can ignore that 400 part from the above for now. As they are independent "projects", it is unreasonable to expect that any solution would add costs less than linearly when you add more projects. But it should be possible to make the cost of update discovery comparable between the cases where you have 10K existing refs that both ends have seen and one ref got updated vs you started from 20K common refs. The latter does not have to cost twice (or 4 times if an inefficient way is quadratic). I think the assumption in your build-bot scenario needs to be a bit more clarified to give context. I am assuming, when you say "see if projects are up to date", you mean your build-bot is interested in a single branch in each repository, possibly together with new tags that have become reachable from the updated tip of the branch (if the branch tip moved). I also assume that the build-bot polls very often and that is why 200MB (or divided by 400 it would be 0.5MB, which still is a lot when you talk about a single repository) hurts a lot. Everything below is what I think you already know about; I am giving an overview of _my_ understanding of the issue for the benefit of others (and have you sanity check if I misunderstood your pain points). Even an attempt to optimize "can we skip updating" by asking "ls-remote" about a single branch is futile with the current protocol, as we expect the server to respond with the full ref advertisement and filter out the result on our end. In the upload-pack protocol, the serving side talks first and that is "here are all the refs I have and their values". The other side that asks for objects does not have a way to tell it that it is only interested in a subset, even if it wanted to. Unlike the capability exchange that happens after the initial connection, there is no gap in the protocol for the side that initiates the connection to tell the serving side to skip/delay the initial ref advertisement. What the above means is that the transition has to happen by keeping the current upload-pack and a new protocol has to be invoked by a different program ("upload-pack-2" or something needs to be taught to the git-daemon and also invoked by ssh) even if an updated protocol is designed. The updated connection initiator (i.e. ls-remote and fetch) may be able to try upload-pack-2 first and fall back to upload-pack after getting a failure, but afaik nobody figured out if this is doable by checking if a failure signal that comes from currently deployed software is detailed enough for our side to tell if the failure is because the other end does not support upload-pack-2 or because of some other failure (e.g. auth failed, no such repo, etc.). Regardless of auto-fallback, an updated connection initiator needs a configuration switch to be explicitly told which protocol to talk to with which remote. What exact protocol "upload-pack-2" talks may be an interesting subject on its own. It may still have the serving side talk first (probably show its capabilities and tips of branches if there are not too many of them), or it may instead let the connection initiator talk first and ask for specific set of refs. But that is of lessor importance from the high-level point of view and I won't go into the details in this thread. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html