Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 24 May 2012 06:54:07 +0200

On 05/22/2012 07:35 PM, Junio C Hamano wrote:
The current code reads the whole thing in upon first use of _any_ element
in the file, just like the index codepath does for the index file.

But the calling pattern to the refs machinery is fairly well isolated and
all happens in refs.c file.  Especially thanks to the recent work by
Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
kind of callers, it is reasonably easy to design a better structured
packed-refs file format to allow us to read only a subtree portion of
refs/ hierarchy, and plug that logic into the lazy ref population code.
Such a "design a better packed-refs format for scalability to 400k refs"
is a very well isolated project that has high chance of succeeding without
breaking things.  No code outside refs.c assumes that there is a flat
array of refs that records what was read from the packed-refs file and can
walk linearly over it, unlike the in-core index.

Even with the current file format, it would not be so difficult to 
bisect the file, synchronizing on record boundaries by looking for the 
next NL character.  Because of the way the file is sorted, it would also 
be reasonably efficient to read whole subtrees in one slurp (e.g., for 
for_each_ref() with a prefix argument).  Nontrivial modifications would 
of course not be possible without a rewrite.

There would need to be some intelligence built-in; after enough 
single-reference accesses come in a row, then the refs module should 
probably take it upon itself to read the whole packed-refs file to speed 
up further lookups.

If you do "for_each_ref()" for everything (e.g. sending 'have' during the
object transfer, or repacking the whole repository), you would end up
needing to read the whole thing for obvious reasons.

Yes.  ISTM that any hope to avoid O(number of refs) problems when 
exchanging commits must involve using more intelligence about how 
references are related to each other topologically to improve the 
negotiation about what needs to be transferred.

Michael

--
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html