git-index-pack really does suck..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio, Nico,
 I think we need to do something about it.

CLee was complaining about git-index-pack on #irc with the partial KDE 
repo, and while I don't have the KDE repo, I decided to investigate a bit.

Even with just the kernel repo (with a single 170MB pack-file), I can do

	git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack

and it uses 52s of CPU-time, and on my 4GB machine it actually started 
doing IO and swapping, because git-index-pack grew to 4.8GB in size. So 
while I initially thought I'd want a bigger test-case to see the problem, 
I sure as heck don't.

The 52s of CPU time exploded into almost three minutes of actual 
real-time:

	47.33user 5.79system 2:41.65elapsed 32%CPU
	2117major+1245763minor

And that's on a good system with a powerful CPU, "enough memory" for any 
reasonable development, and good disks! Very much ungood-plus-plus.

I haven't looked into exactly why yet, but I bet it's just that we keep 
every single object expanded in memory. We do need to keep the objects 
around, so that we can resolve delta's, but we can certainly do it other 
ways. 

Two suggestion for other ways:

 - simple one: don't keep unexploded objects around, just keep the deltas, 
   and spend tons of CPU-time just re-expanding them if required.

   We *should* be able to do it with just keeping the original 170MB 
   pack-file in memory, not expanding it to 3.8GB! 

   Still, even this will be painful once you have a big pack-file, and the 
   CPU waste is nasty (although a delta-base cache like we do in 
   sha1_file.c would probably fix it 99% - at that point it's getting 
   less simple, and the "best" solution below looks more palatable)

 - best one: when writing out the pack-file, we incrementally keep a 
   "struct packed_git" around, and update the index for it dynamically, 
   and totally get rid of all objects that we've written out, because we 
   can re-create them.

   This means that we should have _zero_ memory footprint except for the 
   one object that we're working on right then and there, and any 
   unresolved deltas where we've not seen the base at all (and the latter 
   generally shouldn't happen any more with most pack-files)

The "best one" wouldn't seem to be *that* painful, but as mentioned, I 
haven't even started looking at the code yet, I thought I'd try to rope 
Nico into looking at this first ;)

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]