Re: git-p4 out of memory for very large repository

Corey Thompson <cmtptr@xxxxxxxxx> · Mon, 26 Aug 2013 09:47:56 -0400

On Sun, Aug 25, 2013 at 11:50:01AM -0400, Pete Wyckoff wrote:
> Modern git, including your version, do "streaming" reads from p4,
> so the git-p4 python process never even holds a whole file's
> worth of data.  You're seeing git-fast-import die, it seems.  It
> will hold onto the entire file contents.  But just one, not the
> entire repo.  How big is the single largest file?
> 
> You can import in pieces.  See the change numbers like this:
> 
>     p4 changes -m 1000 //depot/big/...
>     p4 changes -m 1000 //depot/big/...@<some-old-change>
> 
> Import something far enough back in history so that it seems
> to work:
> 
>     git p4 clone --destination=big //depot/big@60602
>     cd big
> 
> Sync up a bit at a time:
> 
>     git p4 sync @60700
>     git p4 sync @60800
>     ...
> 
> I don't expect this to get around the problem you describe,
> however.  Sounds like there is one gigantic file that is causing
> git-fast-import to fill all of memory.  You will at least isolate
> the change.
> 
> There are options to git-fast-import to limit max pack size
> and to cause it to skip importing files that are too big, if
> that would help.
> 
> You can also use a client spec to hide the offending files
> from git.
> 
> Can you watch with "top"?  Hit "M" to sort by memory usage, and
> see how big the processes get before falling over.
> 
> 		-- Pete

You are correct that git-fast-import is killed by the OOM killer, but I
was unclear about which process was malloc()ing so much memory that the
OOM killer got invoked (as other completely unrelated processes usually
also get killed when this happens).

Unless there's one gigantic file in one change that gets removed by
another change, I don't think that's the problem; as I mentioned in
another email, the machine has 32GB physical memory and the largest
single file in the current head is only 118MB.  Even if there is a very
large transient file somewhere in the history, I seriously doubt it's
tens of gigabytes in size.

I have tried watching it with top before, but it takes several hours
before it dies.  I haven't been able to see any explosion of memory
usage, even within the final hour, but I've never caught it just before
it dies, either.  I suspect that whatever the issue is here, it happens
very quickly.

If I'm unable to get through this today using the incremental p4 sync
method you described, I'll try running a full-blown clone overnight with
top in batch mode writing to a log file to see whether it catches
anything.

Thanks again,
Corey
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html