[+cc people who were interested in earlier iterations of this topic] On Sat, Jun 11, 2011 at 07:04:07PM +0000, Ævar Arnfjörð Bjarmason wrote: > This is a resubmission of Jeff King's patch series to speed up git tag > --contains with some changes. It's been cooking for a while as: Thanks for resurrecting this. I've been meaning to look at it again, and somehow an entire year has passed. I've tried to refresh my memory on the issues, so hopefully I can make coherent comments. There have been a few responses in the meantime, and one I want to address is Junio's: http://article.gmane.org/gmane.comp.version-control.git/152765 The major points in it are (I'm paraphrasing for brevity, but please correct me if I'm misrepresenting): 1. A depth-first algorithm has the problem of going down to the roots unnecessarily. Yes, and that's why with my initial patch, "tag --contains" is on the same order of time as "git rev-list --tags >/dev/null" (in the worst case). But with the early return based on commit timestamp, we stop looking down uninteresting paths early. So the downside isn't speed, but trusting commit timestamps. 2. One solution is to use merge-bases to find earlier cutoff points. It's possible. But doesn't the merge bases algorithm, like all of the commit walking, rely somewhat on commit timestamps, too? For example, this thread shows issues with revision limiting: http://thread.gmane.org/gmane.comp.version-control.git/72274 I'm not sure about the merge bases algorithm, though. In the face of skew, I think it can go down a non-optimal path (e.g., going all the way to the root because one branch has commits skewed to look much older than they really are, which pushes them to the back of the commit_list priority queue). But I think it will still find the correct answer. Two problems with doing a merge-base solution are: a. It can still end up hitting the roots, or close to them. The merge-base of something recent and a tag from years ago is going to have to go through years of history. So using a timestamp cutoff is really nice to know that there's no point in digging (on the other hand, searching for something from years ago with respect to recent tags will always have to dig through all of that history; however, I think this is less common than the other way around). b. If you are doing the merge-base over many tags at once, it's hard to figure out which source tag is actually responsible for hitting the merge base. Which leads us to Junio's final point: 3. You can do something like show-branch does, and smudge each commit with a bitfield that has one bit per tag (e.g., using the object flags). My problem with this is that it doesn't scale algorithmically with many tags. If we have a constant number of bits, then that reduces the number of merge-base traversals we have to do by a constant number. Our constant using the flags field would be 27. And reducing the time by a factor of 27 is nice, but I suspect something like the 10K-tags example is still going to be painful, if even one out of the 27 in each traversal has dig far into history. Another option is to trade space for time. Do one traversal, but actually keep a large enough bitfield. For 10K tags, that's about 1K per commit. So for git.git, that's 30M. For linux-2.6, it's 250M. Which is getting pretty big. But remember that's an insane number of tags, and we can also move the slider between time and space (e.g., 5 traversals of 50M each). > I've moved the relevant code from commit.[ch] to builtin/tag.c as > Junio's comment suggested. So IMO the "tag: speed up --contains > calculation" patch is ready to be applied. The only downside to that is that the code is harder to reuse in "branch --contains", which could also benefit. I think the multiple merge-base traversals tend not to be as bad, because branch tips tend to stay recent, and you tend to ask for recent commits. So even though we dig through the same commits multiple times, it all stays in recent history. Whereas tags tend to point to very old things. Still, that is dependent on your repo setup, including numbers of branches and how stale they tend to be. It would be nice if we could always be fast. > The next two patches look OK to me, but they need some documentation > for the core.clockskew variable, which perhaps should be renamed to > tag.clockskew, or was the plan to use it for other things in the > future? It was intended to be used elsewhere. I have a patch to use it in name-rev, which currently just has a hard-coded skew. The problem with a skew variable like this is that you really don't want to set it higher than a day or so. Because it affects all parts of the traversal, not just the parts near the skewed commits. In linux-2.6, for example, the worst skew is about 100 days. Here are timings for "git tag --contains HEAD~200" with various core.clockskew values: - no clock skew tolerated: .035s - 1 day: .034s - 100 days of clock: .252s - infinite: 5.373s So we are almost an order of magnitude slower by having set an appropriate clockskew value. And that's only for 100 days. Some of the projects have skew on the order of years. If we can assume that the skewed commits are relatively rare[1], we might do better to mark individual skewed commits via notes or the replace mechanism. A simple test shows that doing notes lookups is not too expensive: # pretend we have some fake timestamps for i in 20 40 60; do git notes add -m "fake timestamp" HEAD~$i done (best of 5) $ time git log --pretty=raw --no-notes >/dev/null real 0m3.868s user 0m3.796s sys 0m0.060s (best of 5) $ time git log --pretty=raw --show-notes >/dev/null real 0m3.878s user 0m3.812s sys 0m0.052s And then any code wanting to limit traversal would have to check the notes to see if the timestamp was valid (in fact, we would do even fewer lookups, since we only need to check for a bogus timestamp at the edges of our traversal). The replace mechanism could be used instead; it has the advantage that we wouldn't even need to change the traversal code; it would just see the corrected objects with the right timestamp, and has similar performance characteristics. [1] The numbers from Jonathan and Clemens show that in most repos, the numbers of skewed commits tend to be small (single-digits usually, or even in the dozens; but much fewer than the total number of commits). > Is the "Why is "git tag --contains" so slow?" utility something we > want? As it is now, I don't think so. Tweaking core.clockskew is slow, as shown above. And it's not something people should have to do manually. I have a version, which I'll post in a minute, which actually fills in a notes tree with the sha1 of commits with bogus timestamps. And then that tree can be consulted accurately and automatically. However, if we're going to have a look-aside cache of metadata on each commit, maybe it is really time to stop thinking about commit timestamps and start thinking about "generation numbers". This concept has been brought up before on the list; it's basically: 1. Root commits have generation = 0. 2. Other commits have generation = 1 + max(generations of parents). So it's a strictly increasing number, and you know that, given X > Y, X cannot possibly be an ancestor of Y. If this were stored in the commit object, we could use it for all traversals instead of the commit timestamp, and it would presumably be more reliable (you could still have a bogus repo, of course, but it would come from a bug in git, not from importing old history or having your clock set wrong). The problem is that existing objects don't have this generation number. It's easy to calculate, though, and we could in theory use a notes-cache to store it externally. Obviously the complexity and performance aren't going to be as good as if it were just in the commit object, but we're sadly 6 years too late to make that decision. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html