Re: Something is broken in repack

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 11 Dec 2007 11:17:08 -0800 (PST)

On Tue, 11 Dec 2007, Jon Smirl wrote:
> >
> > So if you want to use more threads, that _forces_ you to have a bigger
> > memory footprint, simply because you have more "live" objects that you
> > work on. Normally, that isn't much of a problem, since most source files
> > are small, but if you have a few deep delta chains on big files, both the
> > delta chain itself is going to use memory (you may have limited the size
> > of the cache, but it's still needed for the actual delta generation, so
> > it's not like the memory usage went away).
> 
> This makes sense. Those runs that blew up to 4.5GB were a combination
> of this effect and fragmentation in the gcc allocator. Google
> allocator appears to be much better at controlling fragmentation.

Yes. I think we do have some case where we simply keep a lot of objects 
around, and if we are talking reasonably large deltas, we'll have the 
whole delta-chain in memory just to unpack one single object.

The delta cache size limits kick in only when we explicitly cache old 
delta results (in case they will be re-used, which is rather common), it 
doesn't affect the normal "I'm using this data right now" case at all.

And then fragmentation makes it much much worse. Since the allocation 
patterns aren't nice (they are pretty random and depend on just the sizes 
of the objects), and the lifetimes aren't always nicely nested _either_ 
(they become more so when you disable the cache entirely, but that's just 
death for performance), I'm not surprised that there can be memory 
allocators that end up having some issues.

> Is there a reasonable scheme to force the chains to only be loaded
> once and then shared between worker threads? The memory blow up
> appears to be directly correlated with chain length.

The worker threads explicitly avoid touching the same objects, and no, you 
definitely don't want to explode the chains globally once, because the 
whole point is that we do fit 15 years worth of history into 300MB of 
pack-file thanks to having a very dense representation. The "loaded once" 
part is the mmap'ing of the pack-file into memory, but if you were to 
actually then try to expand the chains, you'd be talking about many *many* 
more gigabytes of memory than you already see used ;)

So what you actually want to do is to just re-use already packed delta 
chains directly, which is what we normally do. But you are explicitly 
looking at the "--no-reuse-delta" (aka "git repack -f") case, which is why 
it then blows up.

I'm sure we can find places to improve. But I would like to re-iterate the 
statement that you're kind of doing a "don't do that then" case which is 
really - by design - meant to be done once and never again, and is using 
resources - again, pretty much by design - wildly inappropriately just to 
get an initial packing done.

> That may account for the threaded version needing an extra 20 minutes
> CPU time.  An extra 12% of CPU seems like too much overhead for
> threading. Just letting a couple of those long chain compressions be
> done twice

Well, Nico pointed out that those things should all be thread-private 
data, so no, the race isn't there (unless there's some other bug there).

> I agree, this problem only occurs when people import giant
> repositories. But every time someone hits these problems they declare
> git to be screwed up and proceed to thrash it in their blogs.

Sure. I'd love to do global packing without paying the cost, but it really 
was a design decision. Thanks to doing off-line packing ("let it run 
overnight on some beefy machine") we can get better results. It's 
expensive, yes. But it was pretty much meant to be expensive. It's a very 
efficient compression algorithm, after all, and you're turning it up to 
eleven ;)

I also suspect that the gcc archive makes things more interesting thanks 
to having some rather large files. The ChangeLog is probably the worst 
case (large file with *lots* of edits), but I suspect the *.po files 
aren't wonderful either.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html