On 3/3/2018 12:15 AM, Jeff King wrote:
On Fri, Jan 12, 2018 at 10:56:00AM -0800, csilvers wrote:
This is a resubmission of Jeff King's patch series to speed up git tag
--contains with some changes. It's been cooking for a while as:
Replying to this 6-year-old thread:
Is there any chance this could be resurrected? We are using
phabricator, which uses `git branch --contains` as part of its
workflow. Our repo has ~1000 branches on it, and the contains
operation is eating up all our CPU (and time). It would be very
helpful to us to make this faster!
(The original thread is at
https://public-inbox.org/git/E1OU82h-0001xY-3b@xxxxxxxxxxxxxxxxx/
Sorry, this got thrown on my "to respond" pile and languished.
Thanks for adding me to the thread. It's good to know the pain point
people are having around commit graph walks.
There are actually three things that make "git branch --contains" slow.
First, if you're filtering 1000 branches, we'll run 1000 merge-base
traversals, which may walk over the same commits multiple times.
These days "tag --contains" uses a different algorithm that can look at
all heads in a single traversal. But the downside is that it's
depth-first, so it tends to walk down to the roots. That's generally OK
for tags, since you often have ancient tags that mean getting close to
the roots anyway.
But for branches, they're more likely to be recent, and you can get away
without going very deep into the history.
So it's a tradeoff. There's no run-time switch to flip between them, but
a patch like this:
diff --git a/builtin/branch.c b/builtin/branch.c
index 8dcc2ed058..4d674e86d5 100644
--- a/builtin/branch.c
+++ b/builtin/branch.c
@@ -404,6 +404,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
memset(&array, 0, sizeof(array));
+ filter->with_commit_tag_algo = 1;
filter_refs(&array, filter, filter->kind | FILTER_REFS_INCLUDE_BROKEN);
if (filter->verbose)
drops my run of "git branch -a --contains HEAD~100" from 8.6s to
0.4s on a repo with ~1800 branches. That sounds good, but on a repo with
a smaller number of branches, we may actually end up slower (because we
dig further down in history, and don't benefit from the multiple-branch
speedup).
It's good to know that we already have an algorithm for the multi-head
approach. Things like `git branch -vv` are harder to tease out because
the graph walk is called by the line-format code.
I tried to do a "best of both" algorithm in:
https://public-inbox.org/git/20140625233429.GA20457@xxxxxxxxxxxxxxxxxxxxx/
which finds arbitrary numbers of merge bases in a single traversal. It
did seem to work, but I felt uneasy about some of the corner cases.
I've been meaning to revisit it, but obviously have never gotten around
to it.
The second slow thing is that during the traversal we load each commit
object from disk. The solution there is to keep the parent information
in a faster cache. I had a few proposals over the years, but I won't
even bother to dig them up, because there's quite recent and promising
work in this area from Derrick Stolee:
https://public-inbox.org/git/1519698787-190494-1-git-send-email-dstolee@xxxxxxxxxxxxx/
And finally, the thing that the patches you linked are referencing is
about using commit timestamps as a proxy for generation numbers. And
Stolee's patches actually leave room for real, trustable generation
numbers.
Once we have the serialized commit graph and generation numbers, think
the final step would just be to teach the "tag --contains" algorithm to
stop walking down unproductive lines of history. And in fact, I think we
can forget about the best-of-both multi-tip merge-base idea entirely.
Because if you can use the generation numbers to avoid going too deep,
then a depth-first approach is fine. And we'd just want to flip
git-branch over to using that algorithm by default.
I'll keep this in mind as a target for performance measurements in the
serialized commit graph patch and the following generation number patch.
Thanks,
-Stolee