On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote: > 2012/11/11 Jeff King <peff@xxxxxxxx>: > > On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote: > > > > Ultimately, I have some ideas for doing this in a breadth-first way, > > which would make it more naturally iterative. It would involve having N > > bits of storage per commit to check N tags, but it would mean that we > > could get accurate answers in the face of clock skew (like the > > merge-base calculation, it would merely get slower in the face of skew). > > I guess the optimal algorithm may also depend on the commit graph > general shape, but intuitively, I'd say that the critical factor is > the number and distribution of tags. As soon as you have a significant > number of tags (let's say 1% of the commits are tagged, evenly > distributed), you'll quickly end up with every commit marked as > containing or not the target commit, so that each additional tag check > is cheap. > > This suggests a complexity of O(number of commits) more often then > not, however you choose to traverse the graph. We can do much better than O(number of commits), though, if we stop traversing down a path when its timestamp shows that it is too old to contain the commits we are searching for. The problem is that the timestamps cannot always be trusted, because they are generated on machines with wrong clocks, or by buggy software. This could be solved by calculating and caching a "generation" number, but last time it was discussed there was a lot of arguing and nothing got done. Another approach, used by the merge-base calculation, is to treat parents in a breadth-first way, but sort them by timestamp, and walk backwards to find the merge-base (and if you get to a merge-base, you can stop). In that case, bad timestamps may cause you to look at extra commits (because you process a commit prematurely and end up going deeper than the merge base), but it can never give you a wrong answer. Thinking on it more, though, the merge-base style of computation would mean you always have to walk backwards to the oldest tag. Which is in the same order of magnitude as the number of commits, assuming you have tags near the start of history. So I think we will always want to keep a cutoff, anyway, and there is not much point in switching off of a depth-first approach (but of course it does make sense to use iteration instead of recursion to do so). > At least on my almost-real-life repo*, with ~140,000 commits and > ~2,000 tags, tag --contains takes a few seconds, of course more than > the worst-case test on a 40,000 commit / 1 tag repo, but still in the > same order of magnitude. Try "git rev-list --count --all" to get a sense of how long O(# of commits) takes. Before the depth-first implementation, "tag --contains" was O(commits * tags) in the worst case. With depth first, it's O(commits), and then with the timestamp cutoff, it's really O(commits since needle), where "needle" is the oldest thing you're looking for. Here are those numbers on linux-2.6 (with linux-stable tags): [to get a sense of the repo size] $ git for-each-ref refs/tags | wc -l 909 $ git rev-list --count --all 363413 [this is O(commits)] $ time git rev-list --count --all real 0m4.034s user 0m3.960s sys 0m0.056s [before depth-first, ffc4b80^; you can see that it is much worse than O(commits), though not as bad as the worst case (because finding the merge bases for recent tags is not O(commits)] $ time git tag --contains HEAD~200 real 0m42.838s user 0m42.527s sys 0m0.156s [after depth-first, ffc4b80; you can see that this is O(commits), because we will go depth-first down to the roots, but do only a single traversal] $ time git tag --contains HEAD~200 real 0m3.939s user 0m3.784s sys 0m0.140s [with my generation patches; much faster] $ time git tag --contains HEAD~200 real 0m0.037s user 0m0.020s sys 0m0.012s I was thinking we had merged the timestamp cutoff (which has the same performance characteristics as generations, just with the skew issue) to master, but it looks like we didn't. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html