Re: Why is "git tag --contains" so slow?

Jeff King <peff@xxxxxxxx> · Tue, 6 Jul 2010 07:58:28 -0400

On Mon, Jul 05, 2010 at 10:10:12AM -0400, tytso@xxxxxxx wrote:

> As time progresses, the clock skew breakage should be less likely to
> be hit by a typical developer, right?  That is, unless you are
> specifically referencing one of the commits which were skewed, two
> years from now, the chances of someone (who isn't doing code
> archeology) of getting hit by a problem should be small, right?  This

It's not about directly referencing skewed commits. It's about
traversing history that contains skewed commits. So if I have a history
like:

  A -- B -- C -- D

and "B" is skewed, then I will generally give up on finding "A" when
searching backwards from "C" or "D", or their descendants. So as time
moves forward, you will continue to have your old tags pointing to "C"
or "D", but also tags pointing to their descendants. Doing "git tag
--contains A" will continue to be inaccurate, since it will continue to
look for "A" from "C" and "D", but also from newer tags, all of which
involve traversing the skewed "B".

What I think is true is that people will be less likely to look at "A"
as time goes on, as code it introduced presumably becomes less relevant
(either bugs are shaken out, or it gets replaced, or whatever). And
obviously looking at "C" from "D", the skew in "B" will be irrelevant.

So I think typical developers become less likely to hit the issue as
time goes on, but software archaeologists will hit it forever.

> If so, I could imagine the automagic scheme choosing a default that
> only finds the worst skew in the past N months.  This would speed up
> things up for users who are using repositories that have skews in the
> distant past, at the cost of introducing potentially confusuing edge
> cases for people doing code archeology.

How do you decide, when looking for commits that have bogus timestamps,
which ones happened in the past N months? Certainly you can do some
statistical analysis to pick out anomalous ones. And you could perhaps
favor future skewing over past skewing, since that skew doesn't tend to
impact traversal cutoffs (and large past skewing seems to be more
common). But that is getting kind of complex.

> I'm not sure this is a good tradeoff, but given in practice how rarely
> most developers go back in time more than say, 12-24 months, maybe
> it's worth doing.  What do you think?

I'm not sure. I am tempted to just default it to 86400 and go no
further.  People who care about archaeology can turn off traversal
cutoffs if they like, and as the skewed history ages, people get less
likely to look at it. We could also pick half a year or some high number
as the default allowable. The performance increase is still quite
noticeable there, and it covers the only large skew we know about. I'd
be curious to see if other projects have skew, and how much.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html