Re: unable to run gc (or git repack -Adl )

Nicolas Pitre <nico@xxxxxxxxxxx> · Fri, 29 Jan 2010 21:14:47 -0500 (EST)

On Fri, 29 Jan 2010, Jon Nelson wrote:

> Using 1.6.4.2 on openSUSE 11.2 (x86_64).
> 
> I have a beefy repo (du of 14GB) that I can't seem to run 'gc' on.
> 
> After running for over 2 hours, this is what I get:
> 
> Counting objects: 267676, done.
> Compressing objects: 100% (217424/217424), done.
> fatal: Unable to create temporary file: Too many open files
> error: failed to run repack

Ouch!!  Impressive.

> Ugh!

Indeed.

> I have 3 GB of memory (and 1GB of swap).
> When I strace the various processes, I see some things I don't understand:
> 
> 1. I see the 'git-repack' shell process scanning for .keep files. I
> don't have any. Is there a shortcut to this?
> 
> It's also hugely inefficient. In this case, the code to identify non
> .keep packs takes *4 minutes, 45 seconds*, lots of disk I/O, and lots
> of CPU (it pegs one CPU at 100% for the entire duration). With a wee
> bit of awk, I have reduced that to 2.3 seconds with VASTLY reduced I/O
> and CPU requirements. Patch attached.

Your patch will pick any .pack file in the repo not only from the 
.git/objects/pack directory.  There is no such thing as *.pack.keep 
either.

> 2. When git-pack-objects is being run, around the time it's 85% done
> "compressing" it's very very very slow. Like, 2-5 objects every
> second. The largest object in the repo is about 1MB.

You probably consumed all RAM and started swapping at that point.

Or... you have many of those 1MB objects.  If so try 
using --window-memory=8M or similar.

> 3. When git pack objects is running and counting up the number of
> objects, it is stat'ing files that aren't in the working directly, and
> should not be, according to the index. If I switch the repo to be a
> "bare" repository, then it doesn't do that, however, why is it doing
> that in the first place?

A bare repository has no index.  When the index is present though, it is 
necessary to also pack objects it references.  Why working directory 
files would be stat()'d in that case I don't know.

> 4. Should git-pack-objects be reading the pack.idx files for counting
> objects instead of the .pack files themselves?

No.  The whole point when "counting objects" is to perform a walk of the 
history graph and capture the set of objects that are actually 
referenced from your branches/tags and leave the unreferenced objects 
behind.  Also the order in which those objects are encountered during 
that history walk is very important for efficient object placement in 
the final pack.  So this is much more involved than only listing the 
objects contained in every packs.

> 5. There is no 5

I'm a flying buldozer.

> 6. Should git-pack-objects be closing .pack files after opening them?
> I have 6559 .pack files.

No wonder why you exhausted your file handles.  And your repository must 
be _horribly_ slow to work with, which might explain the 
slowness/swappiness.

> 7. Ultimately, how do I get "git gc" to work on this repo?

... because you really really want to repack this mess ASAP of course.

Having so many packs means they must be relatively small.  Yet, Git 
allows up to 8GB of pack data to be mmap()'d at once on x86_64.  This 
means that an average of 3700 packs might be mapped at once, plus their 
respective .idx files.

You could try:

	git config core.packedGitLimit 256m
	git config core.packedGitWindowSize 32m
	git config pack.deltaCacheSize 1

and try repacking again with 'git gc --prune=now'.  After the repack 
succeeds, you should be able to remove the above configs from your 
.git/config file.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html