On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote: > > Yes. If you know that the receiver has commit X, and you want to know if > > it has some blob Y, the only way to know for sure is to look at every > > tree of every commit reachable from X, and see whether any of them > > references Y. > > Jeff, in my original example, I did a cherry-pick from origin/somebranch. Sorry, I thought we were talking about the general case, not your specific example. > Even without asking, we can assume with great probability that > origin/somebranch is available at origin. Bear in mind that the transfer process does not know about cherry-picking at all. It only sees the other side's tips and traverses. > And the file in question happens to reside in the tree at the very tip > of origin/somebranch, not in some of its ancestors. In this case, > there's no need to search the history at all. And even in this pretty > simple case, the algorithm seems to fail for some reason. Correct. And in the current code, we should be looking at the tip tree for your case. However, the usual reason to do so is to mark those objects as a "preferred base" in pack-objects for doing deltas. I wonder if we are not correctly noticing the case that an object is both requested to be sent and marked as a preferred base (in which case we should drop it from our sending list). If that's the problem, it should be easy to fix cheaply. It would not work in the general case, but it would for your specific example. But since it costs nothing, there's no reason not to. I'll see if I can investigate using the example script you posted. > > Yes, you do not have to recurse into sub-trees (or commits) you have > > already seen. And we already do that optimization. > > Why is the file re-transmitted, then? I meant "we do the optimization during history traversal that avoids going into sub-trees we have already seen". We do _not_ do the full history traversal for a partial push. > With a little change in the protocol, a very simple optimization could be > implemented, avoiding the complicated bitmap strategy you were talking about: > [...] > In the last step, instead of sending a packfile, the sending side should send > a list of the SHA's which would be included in this packfile. The receiving > side would then be able to request all the objects it needs to get up-to-date. I think you have a mis-impression of the problem bitmaps are trying to solve, mostly because it is not the problem you presented in your thread (but your problem is one that bitmaps can help). Consider what the sending side has to do to come up with that list of objects to send. It has to traverse history to do it, looking at each tree of each commit that is going to be sent. That effort is proportional to the amount of history we are going to send. For a small push or fetch, it is not much. For a clone, it can be quite a lot (tens of seconds of CPU time per clone). Bitmaps drastically reduce the amount of CPU required. Bitmaps can _also_ solve other problems, like letting us be more thorough in realizing which objects the other side has (without spending effort on an expensive traversal). If that were the only thing they did, it might not be worth it. But we basically get that for "free" by solving the other problem. So I do not think such a protocol extension is an argument against pack bitmaps; you would want them with or without the protocol change. > I think this change would be considerably simpler than the reachability bitmap > you are talking about. And it would avoid all those time consuming traversals > through the history and the tree. And it would omit _all_ redundant > retransmissions. Even in the case when sender and receiver do not have _any_ > common heads at all, _no_ files at all would be retransmitted unnecessarily. Yes, that would be nice. However, in the common cases it would make things much worse. A clone of linux.git has ~3.5M objects. That's 70MB of sha1s that the server tells the client "tell me which of these you need", and then another 70MB for the client to say "yep, I need all of them". You could have the client instead say "here are the few I _don't_ need", which would save the second half in the common cases. And of course it would be smaller for a smaller fetch/push. Just looking at "git rev-list --objects --all --since=1.week.ago" in the kernel, there are ~77K new objects. So over the course of a week, we use an extra 1.5MB of bandwidth. How many objects did we save ourselves from sending, and how big were they (keep in mind they will typically be deltas against object you already have anyway)? The answer would depend on your cherry-picking and reverting habits. But I would guess in a normal workflow that you would not come close to breaking even. There are, of course, cases where the user _knows_ there is a huge object that the other side has, and they do not want to have to send it again. For that case, it would be useful to have something like the protocol you described as an optional extension to turn on. Of course, somebody has to implement it. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html