Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

Jeff King <peff@xxxxxxxx> · Tue, 22 May 2012 13:46:48 -0400

On Tue, May 22, 2012 at 10:35:50AM -0700, Junio C Hamano wrote:

> > In this case, the packed-refs file is 30MB. Even just gzipping it takes
> > it down to 2MB. As far as I know, we don't ever do random access on the
> > file, but instead just stream it into memory.
> 
> True.
> 
> The current code reads the whole thing in upon first use of _any_ element
> in the file, just like the index codepath does for the index file.
> 
> But the calling pattern to the refs machinery is fairly well isolated and
> all happens in refs.c file.  Especially thanks to the recent work by
> Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
> have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
> kind of callers, it is reasonably easy to design a better structured
> packed-refs file format to allow us to read only a subtree portion of
> refs/ hierarchy, and plug that logic into the lazy ref population code.
> Such a "design a better packed-refs format for scalability to 400k refs"
> is a very well isolated project that has high chance of succeeding without
> breaking things.  No code outside refs.c assumes that there is a flat
> array of refs that records what was read from the packed-refs file and can
> walk linearly over it, unlike the in-core index.

Yeah. As gross as I find a 30MB packed-refs file, I don't actually think
that is the bottle-neck at all. Sure, it wastes some disk cache, but
in the grand scheme of things, anybody with 400K refs probably has a
much, much bigger object database.

So there probably isn't much point in clever tricks to make it smaller
on disk, especially if they will make further partial-read optimizations
harder to do. But...

> If you do "for_each_ref()" for everything (e.g. sending 'have' during the
> object transfer, or repacking the whole repository), you would end up
> needing to read the whole thing for obvious reasons.

I think Michael's work is really great for the loose refs, where hitting
each directory is very expensive. But for a repo with packed refs that
is just going to call for_each_ref, I'm worried that breaking apart the
ref list is going to cause noticeable overhead. But still, the
regression I'm seeing seems way too big for that overhead, so I think
something else is going on. We just need to find it.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html