Re: [PATCH] git tag --contains : avoid stack overflow

Jeff King <peff@xxxxxxxx> · Mon, 12 Nov 2012 18:14:53 -0500

On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote:

> 2012/11/11 Jeff King <peff@xxxxxxxx>:
> > On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
> >
> > Ultimately, I have some ideas for doing this in a breadth-first way,
> > which would make it more naturally iterative. It would involve having N
> > bits of storage per commit to check N tags, but it would mean that we
> > could get accurate answers in the face of clock skew (like the
> > merge-base calculation, it would merely get slower in the face of skew).
> 
> I guess the optimal algorithm may also depend on the commit graph
> general shape, but intuitively, I'd say that the critical factor is
> the number and distribution of tags. As soon as you have a significant
> number of tags (let's say 1% of the commits are tagged, evenly
> distributed), you'll quickly end up with every commit marked as
> containing or not the target commit, so that each additional tag check
> is cheap.
> 
> This suggests a complexity of O(number of commits) more often then
> not, however you choose to traverse the graph.

We can do much better than O(number of commits), though, if we stop
traversing down a path when its timestamp shows that it is too old to
contain the commits we are searching for. The problem is that the
timestamps cannot always be trusted, because they are generated on
machines with wrong clocks, or by buggy software. This could be solved
by calculating and caching a "generation" number, but last time it was
discussed there was a lot of arguing and nothing got done.

Another approach, used by the merge-base calculation, is to treat
parents in a breadth-first way, but sort them by timestamp, and walk
backwards to find the merge-base (and if you get to a merge-base, you
can stop). In that case, bad timestamps may cause you to look at extra
commits (because you process a commit prematurely and end up going
deeper than the merge base), but it can never give you a wrong answer.

Thinking on it more, though, the merge-base style of computation would
mean you always have to walk backwards to the oldest tag. Which is in
the same order of magnitude as the number of commits, assuming you have
tags near the start of history. So I think we will always want to keep a
cutoff, anyway, and there is not much point in switching off of a
depth-first approach (but of course it does make sense to use iteration
instead of recursion to do so).

> At least on my almost-real-life repo*, with ~140,000 commits and
> ~2,000 tags, tag --contains takes a few seconds, of course more than
> the worst-case test on a 40,000 commit / 1 tag repo, but still in the
> same order of magnitude.

Try "git rev-list --count --all" to get a sense of how long O(# of
commits) takes. Before the depth-first implementation, "tag --contains"
was O(commits * tags) in the worst case. With depth first, it's
O(commits), and then with the timestamp cutoff, it's really O(commits
since needle), where "needle" is the oldest thing you're looking for.
Here are those numbers on linux-2.6 (with linux-stable tags):

  [to get a sense of the repo size]
  $ git for-each-ref refs/tags | wc -l
  909
  $ git rev-list --count --all
  363413

  [this is O(commits)]
  $ time git rev-list --count --all
  real    0m4.034s
  user    0m3.960s
  sys     0m0.056s

  [before depth-first, ffc4b80^; you can see that it is much worse than
   O(commits), though not as bad as the worst case (because finding the
   merge bases for recent tags is not O(commits)]
  $ time git tag --contains HEAD~200
  real    0m42.838s
  user    0m42.527s
  sys     0m0.156s

  [after depth-first, ffc4b80; you can see that this is O(commits),
   because we will go depth-first down to the roots, but do only a
   single traversal]
  $ time git tag --contains HEAD~200
  real    0m3.939s
  user    0m3.784s
  sys     0m0.140s

  [with my generation patches; much faster]
  $ time git tag --contains HEAD~200
  real    0m0.037s
  user    0m0.020s
  sys     0m0.012s

I was thinking we had merged the timestamp cutoff (which has the same
performance characteristics as generations, just with the skew issue) to
master, but it looks like we didn't.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html