Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags

Julian Phillips <julian@xxxxxxxxxxxxxxxxx> · Wed, 16 Sep 2009 23:32:52 +0100 (BST)

On Wed, 16 Sep 2009, Junio C Hamano wrote:

Julian Phillips <julian@xxxxxxxxxxxxxxxxx> writes:

I have a repository at $dayjob where fetch was taking ~30s to tell me
that there were no updates.

It turns out that I appear to have added a nasty linear search of all
remote refs for every commit (i.e. tag^{}) tag ref way back in the
original C implementation of fetch.  This doesn't scale well to large
numbers of refs, so this replaces it with a hash table based lookup
instead, which brings the time down to a few seconds even for very large
ref counts.

I haven't tested it with non-native transports, but there is no reason
to believe that the code should be transport specific.

Very interesting.

A few questions (not criticisms).

* 1m50s to 4.5s is quite impressive, even if it is only in a repository
  with unusual refs-vs-commits ratio, but I personally think 10 refs per
  every commit is already on the borderline of being insane, and the
  normal ratio would be more like 1 refs per every 10-20 commits.

I noticed the problem with a real repository at $dayjob, and did enough 
anaylsis to identiy the problem.  I deliberately created a repository that 
should emphasise the problem so that it was easy to get a handle on.

Having applied the ref-dict patches fetch on my $dayjob repo has gone from 
~30s to ~4.5s, which may not be as impressive - but is much easier to live 
with.

  What are possible downsides with the new code in repositories with more
  reasonable refs-vs-commits ratio?  A hash table (with a sensible hash
  function) would almost always outperform linear search in an randomly
  ordered collection, so my gut tells me that there won't be performance
  downsides, but are there other potential issues we should worry about?

I guess that main thing would be the extra memory usage.

* In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
  like 100:1, but with a history with far more than one commit, what is
  the memory consumption?  Judging from a cursory view, I think the way
  ref-dict re-uses struct ref might be quite suboptimal, as you are using
  only next (for hash-bucket link), old_sha1[] and its name field, and
  also your ref_dict_add() calls alloc_ref() which calls one calloc() per
  requested ref, instead of attempting any bulk allocation.

Yeah, I just reused struct for speed and convience of developing, to 
veryify that ref-dict would give me the speed I wanted.  A final patch 
would want a more optimised version.  Except that I've thrown the whole 
hash table thing away anyway.

* The outer loop is walking the list of refs from a transport, and the
  inner loop is walking a copy of the same list of refs from the same
  transport, looking for each refs/tags/X^{} what record, if any, existed
  for refs/tags/X.

  Would it make sense to further specialize your optimization?  For
  example, something like...

I actually arrived at somthing similar to this myself, after realising 
that I could use string_list as a basis.

* It is tempting to use a hash table when you have to deal with an
  unordered collection, but in this case, wouldn't the refs obtained from
  the transport (it's essentially a ls-remote output, isn't it?) be
  sorted?  Can't you take advantage of that fact to optimize the loop,
  without adding a specialized hash table implementation?

I wasn't sure if we could rely on the refs list being sorted.  But I've 
got a new version that uses an extra string_list instead that is actually 
slightly faster.  I'll post that shortly.

  We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
  the ls-remote output.  And the inefficient loop is about finding
  refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
  tag ref we saw in the previous round, we can check with that first to
  make sure our "sorted" assumption holds true, and optimize the loop out
  that way, no?

--
Julian

 ---
Bilbo's First Law:
	You cannot count friends that are all packed up in barrels.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html