On Tue, 11 Dec 2007, Jon Smirl wrote: > > > > So if you want to use more threads, that _forces_ you to have a bigger > > memory footprint, simply because you have more "live" objects that you > > work on. Normally, that isn't much of a problem, since most source files > > are small, but if you have a few deep delta chains on big files, both the > > delta chain itself is going to use memory (you may have limited the size > > of the cache, but it's still needed for the actual delta generation, so > > it's not like the memory usage went away). > > This makes sense. Those runs that blew up to 4.5GB were a combination > of this effect and fragmentation in the gcc allocator. Google > allocator appears to be much better at controlling fragmentation. Yes. I think we do have some case where we simply keep a lot of objects around, and if we are talking reasonably large deltas, we'll have the whole delta-chain in memory just to unpack one single object. The delta cache size limits kick in only when we explicitly cache old delta results (in case they will be re-used, which is rather common), it doesn't affect the normal "I'm using this data right now" case at all. And then fragmentation makes it much much worse. Since the allocation patterns aren't nice (they are pretty random and depend on just the sizes of the objects), and the lifetimes aren't always nicely nested _either_ (they become more so when you disable the cache entirely, but that's just death for performance), I'm not surprised that there can be memory allocators that end up having some issues. > Is there a reasonable scheme to force the chains to only be loaded > once and then shared between worker threads? The memory blow up > appears to be directly correlated with chain length. The worker threads explicitly avoid touching the same objects, and no, you definitely don't want to explode the chains globally once, because the whole point is that we do fit 15 years worth of history into 300MB of pack-file thanks to having a very dense representation. The "loaded once" part is the mmap'ing of the pack-file into memory, but if you were to actually then try to expand the chains, you'd be talking about many *many* more gigabytes of memory than you already see used ;) So what you actually want to do is to just re-use already packed delta chains directly, which is what we normally do. But you are explicitly looking at the "--no-reuse-delta" (aka "git repack -f") case, which is why it then blows up. I'm sure we can find places to improve. But I would like to re-iterate the statement that you're kind of doing a "don't do that then" case which is really - by design - meant to be done once and never again, and is using resources - again, pretty much by design - wildly inappropriately just to get an initial packing done. > That may account for the threaded version needing an extra 20 minutes > CPU time. An extra 12% of CPU seems like too much overhead for > threading. Just letting a couple of those long chain compressions be > done twice Well, Nico pointed out that those things should all be thread-private data, so no, the race isn't there (unless there's some other bug there). > I agree, this problem only occurs when people import giant > repositories. But every time someone hits these problems they declare > git to be screwed up and proceed to thrash it in their blogs. Sure. I'd love to do global packing without paying the cost, but it really was a design decision. Thanks to doing off-line packing ("let it run overnight on some beefy machine") we can get better results. It's expensive, yes. But it was pretty much meant to be expensive. It's a very efficient compression algorithm, after all, and you're turning it up to eleven ;) I also suspect that the gcc archive makes things more interesting thanks to having some rather large files. The ChangeLog is probably the worst case (large file with *lots* of edits), but I suspect the *.po files aren't wonderful either. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html