On Tue, May 22, 2012 at 10:35:50AM -0700, Junio C Hamano wrote: > > In this case, the packed-refs file is 30MB. Even just gzipping it takes > > it down to 2MB. As far as I know, we don't ever do random access on the > > file, but instead just stream it into memory. > > True. > > The current code reads the whole thing in upon first use of _any_ element > in the file, just like the index codepath does for the index file. > > But the calling pattern to the refs machinery is fairly well isolated and > all happens in refs.c file. Especially thanks to the recent work by > Michael Haggerty, for "I am about to create a new branch 'frotz'; do I > have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?" > kind of callers, it is reasonably easy to design a better structured > packed-refs file format to allow us to read only a subtree portion of > refs/ hierarchy, and plug that logic into the lazy ref population code. > Such a "design a better packed-refs format for scalability to 400k refs" > is a very well isolated project that has high chance of succeeding without > breaking things. No code outside refs.c assumes that there is a flat > array of refs that records what was read from the packed-refs file and can > walk linearly over it, unlike the in-core index. Yeah. As gross as I find a 30MB packed-refs file, I don't actually think that is the bottle-neck at all. Sure, it wastes some disk cache, but in the grand scheme of things, anybody with 400K refs probably has a much, much bigger object database. So there probably isn't much point in clever tricks to make it smaller on disk, especially if they will make further partial-read optimizations harder to do. But... > If you do "for_each_ref()" for everything (e.g. sending 'have' during the > object transfer, or repacking the whole repository), you would end up > needing to read the whole thing for obvious reasons. I think Michael's work is really great for the loose refs, where hitting each directory is very expensive. But for a repo with packed refs that is just going to call for_each_ref, I'm worried that breaking apart the ref list is going to cause noticeable overhead. But still, the regression I'm seeing seems way too big for that overhead, so I think something else is going on. We just need to find it. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html