On Tue, 11 Dec 2007, Jon Smirl wrote: > This makes sense. Those runs that blew up to 4.5GB were a combination > of this effect and fragmentation in the gcc allocator. I disagree. This is insane. > Google allocator appears to be much better at controlling fragmentation. Indeed. And if fragmentation is indeed wasting half of Git's memory usage then we'll have to come with a custom memory allocator. > Is there a reasonable scheme to force the chains to only be loaded > once and then shared between worker threads? The memory blow up > appears to be directly correlated with chain length. No. That would be the equivalent of holding each revision of all files uncompressed all at once in memory. > > That said, I suspect there are a few things fighting you: > > > > - threading is hard. I haven't looked a lot at the changes Nico did to do > > a threaded object packer, but what I've seen does not convince me it is > > correct. The "trg_entry" accesses are *mostly* protected with > > "cache_lock", but nothing else really seems to be, so quite frankly, I > > wouldn't trust the threaded version very much. It's off by default, and > > for a good reason, I think. > > > > For example: the packing code does this: > > > > if (!src->data) { > > read_lock(); > > src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz); > > read_unlock(); > > ... > > > > and that's racy. If two threads come in at roughly the same time and > > see a NULL src->data, theÿ́'ll both get the lock, and they'll both > > (serially) try to fill it in. It will all *work*, but one of them will > > have done unnecessary work, and one of them will have their result > > thrown away and leaked. > > That may account for the threaded version needing an extra 20 minutes > CPU time. An extra 12% of CPU seems like too much overhead for > threading. Just letting a couple of those long chain compressions be > done twice No it may not. This theory is wrong as explained before. > > > > Are you hitting issues like this? I dunno. The object sorting means > > that different threads normally shouldn't look at the same objects (not > > even the sources), so probably not, but basically, I wouldn't trust the > > threading 100%. It needs work, and it needs to stay off by default. > > > > - you're working on a problem that isn't really even worth optimizing > > that much. The *normal* case is to re-use old deltas, which makes all > > of the issues you are fighting basically go away (because you only have > > a few _incremental_ objects that need deltaing). > > I agree, this problem only occurs when people import giant > repositories. But every time someone hits these problems they declare > git to be screwed up and proceed to thrash it in their blogs. It's not only for repack. Someone just reported git-blame being unusable too due to insane memory usage, which I suspect is due to the same issue. Nicolas