Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 22 May 2012 10:35:50 -0700

Jeff King <peff@xxxxxxxx> writes:

> On Tue, May 22, 2012 at 07:18:00PM +0700, Nguyen Thai Ngoc Duy wrote:
>
>> On Tue, May 22, 2012 at 12:45 AM, Jeff King <peff@xxxxxxxx> wrote:
>> > The rails/rails network repository at GitHub (i.e., a master repo with
>> > all of the objects and refs for all of the forks) has about 400K refs,
>> > and has been the usual impetus for me finding and fixing these sorts of
>> > quadratic problems.
>> 
>> Off topic and pure speculation. With 400k refs, each one 20 byte in
>> length, the pathname part only can take 7MB. Perhaps packed-refs
>> should learn prefix compressing too, like index v4, to reduce size
>> (and hopefully improve startup speed). Compressing refs/heads/ and
>> refs/tags/ only could gain quite a bit already.
>
> In this case, the packed-refs file is 30MB. Even just gzipping it takes
> it down to 2MB. As far as I know, we don't ever do random access on the
> file, but instead just stream it into memory.

True.

The current code reads the whole thing in upon first use of _any_ element
in the file, just like the index codepath does for the index file.

But the calling pattern to the refs machinery is fairly well isolated and
all happens in refs.c file.  Especially thanks to the recent work by
Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
kind of callers, it is reasonably easy to design a better structured
packed-refs file format to allow us to read only a subtree portion of
refs/ hierarchy, and plug that logic into the lazy ref population code.
Such a "design a better packed-refs format for scalability to 400k refs"
is a very well isolated project that has high chance of succeeding without
breaking things.  No code outside refs.c assumes that there is a flat
array of refs that records what was read from the packed-refs file and can
walk linearly over it, unlike the in-core index.

If you do "for_each_ref()" for everything (e.g. sending 'have' during the
object transfer, or repacking the whole repository), you would end up
needing to read the whole thing for obvious reasons.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html