Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Jeff King <peff@xxxxxxxx> · Mon, 21 May 2012 18:14:17 -0400

On Mon, May 21, 2012 at 01:45:25PM -0400, Jeff King wrote:

> > I don't plan to work on this, but I thought I would point it out in
> > case it is causing somebody pain.
> 
> I'll clean up the patch and make one for the filter_refs case, too.

Here are the patches. I tested these with three repositories:

  - the rails/rails network repo, with ~400K refs

  - the git/git network repo, with ~220 refs

  - a fake repo made from git.git, with 3 numbered refs pointing at each
    commit for a total of ~100K refs (I didn't have a real dataset that
    was ~100K).

Before these patches, I never managed to complete a "git clone --mirror
file://$repo" on the first two, as I got impatient after 5-10 minutes
and killed the process. The third one completed in ~110s.

  [1/5]: fetch-pack: sort incoming heads
  [2/5]: fetch-pack: avoid quadratic behavior in remove_duplicates

After this one, the "fake" case was down to ~63s.

  [3/5]: add sorting infrastructure for list refs
  [4/5]: fetch-pack: sort the list of incoming refs
  [5/5]: fetch-pack: avoid quadratic loop in filter_refs

And after this one, the "fake" case was down to ~32s. Notably, most of
the time was spent on the actual object transfer (i.e., before it would
hang before even getting to "Counting objects...", and now it starts
that almost immediately). Perf corroborates this by showing most of the
time in zlib inflate and delta resolution.

The rails and git cases run in ~28s and ~37s, respectively, again mostly
going to the actual object transfer. So I think this series removes all
of the asymptotically bad behavior from this code path.

One thing to note about all of these repos is that they tend to have
several refs pointing to a single commit. None of the speedups in this
series depends on that fact, but it may be that on a repo with more
independent refs, we may uncover other code paths (e.g., I know that my
fix for mark_complete in ea5f220 improves the case with duplicate refs,
but would not help if you really have 400K refs pointing to unique
commits[1]).

Martin, let me know if this improves things for your many-ref cases (and
if you are still seeing slowness in your repos with many refs, let me
know which operations cause it).

-Peff

[1] At the time of that commit, I proposed a fix with a priority queue
    replacing the commit_list, but we deemed it too much code for
    a case that was unlikely to happen. But now it sounds like that case
    is less unlikely than we thought, and now that we have René's
    linked-list merge-sort code, I think we could fix it with a 2-line
    change. I'll look into it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html