Re: Git push performance problems with ~100K refs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 29, 2012 at 08:43:06PM -0600, Martin Fick wrote:

> >It is trying to minimize the transfer cost.  By showing a ref to the
> >sending side, you prove you have chains of commits leading to that
> >commit
> >and the sender knows that it does not have to send objects that are
> >reachable from that ref. One thing you could immediately do is de-dup
> >the
> >100k refs but we may already do that in the current code.
> 
> I am sorry I don't quite understand what you are suggesting is taking
> up the CPU time?  It doesn't take that much CPU just to gather 100refs
> and send them to the other side, that would be i/o bound.  Could you
> explain what is happening on the receiving side that is so time
> consuming?

You said earlier that it is "git rev-list --objects --stdin --not --all"
taking up all the CPU. That is probably called by
check_everything_connected. And that is why it is slow when you push
even a small change, but fast when you push only a deletion (in the
latter case, we skip the check because there are no new objects).

As for why that rev-list is slow, my suspicion is that it may be
quadratic behavior in commit_list_insert_by_date as we process the set
of negative refs. Basically, we keep a priority queue of commits to be
processed in our graph walk, but the queue is stored as a linked list.
So insertion is O(n), and building a list of n items (especially if they
are not in sorted order) is O(n^2).

I've run into this before dealing with repos with many refs (at GitHub,
some of our alternates repositories hit 100K refs, although typically we
have a lot of duplicated refs, as we are storing identical tags from
many repositories).

But that's just a suspicion. I don't have time tonight to work out a
test case. Is it possible for you to run something like:

  # make a new commit on top of HEAD, but not yet referenced
  sha1=`git commit-tree HEAD^{tree} -p HEAD </dev/null`

  # now do the same "connected" test that receive-pack would do
  git rev-list --objects $sha1 --not --all

That should replicate the slow behavior you are seeing. If that works,
try running the latter command under "perf"; my guess is that you will
see commit_list_insert_by_date as a hot-spot.

Even doing this simple test on a moderate repository (my git.git has
~1100 refs), commit_list_insert_by_date accounts for 10% of the CPU
according to perf.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]