RE: [PATCH] git-svn: make batch mode optional for git-cat-file

Victor Leschuk <vleschuk@xxxxxxxxxxxxxxxx> · Wed, 23 Sep 2015 08:28:02 -0700

Hello Eric, thanks for looking into it.

>> git-cat-file has outgrown the parent perl process several times
>> (git-cat-file - ~3-4Gb, perl - 400Mb).

> Ugh, that sucks.
> Even the 400Mb size of Perl annoys me greatly and I'd work
> on fixing it if I had more time.

I was going to look at this problem also, but first I'd like to improve the situation with cat-file as on large repos it is larger problem. By the way, what direction would you suggest to begin with?

> A few more questions:

> * What is the largest file that existed in that repo?

About 2.5M

> * Did you try "MALLOC_MMAP_THRESHOLD_" with glibc?

Have just tried it on a smaller repo (which takes about 1 hour to clone and RSS grows from 4M to 40M during the process. Unfortunately there is no much of an effect: max RSS is 41M with default settings and 38M with MALLOC_MMAP_THRESHOLD_=131072.

> If alloc.c is the culprit, I would consider to transparently restart
"cat-file --batch" once it grows to a certain size or after a certain
number of requests are made to it.

alloc.c interface is not used in cat-file at all, only direct calls to xmalloc and xrealloc from wrapper.c, and also xmmap() from sha1_file.c.

> > git-cat-file has outgrown the parent perl process several times
> > (git-cat-file - ~3-4Gb, perl - 400Mb).

> How much of that is anonymous memory, though?

Haven't measured on this particular repo: didn't redo the 2 week experiment =) However I checked on a smaller test repo and anon memory is about 12M out of 40M total. Most of memory is really taken by mmaped *.pack and *idx files.

Actually I accidentally found out that if I export GIT_MALLOC_LIMIT variable set to several megabytes it has the following effect:
 * git-svn.perl launches git-gc
 * git-gc can't allocate enough memory and thus doesn't create any pack files
 * git-cat-file works only with pure blob object, not packs, and it's memory usage doesn't grow larger than 4-5M

It gave me a thought that maybe we could get rid of "git gc" calls after each commit in perl code and just perform one large gc operation at the end. It will cost disk space during clone but save us memory. What do you think?

As for your suggestion regarding periodic restart of batch process inside git-cat-file, I think we could give it a try, I can prepare a patch and run some tests.

--
Best Regards,
Victor
________________________________________
From: Eric Wong [normalperson@xxxxxxxx]
Sent: Tuesday, September 22, 2015 5:35 PM
To: Victor Leschuk
Cc: Junio C Hamano; git@xxxxxxxxxxxxxxx
Subject: Re: [PATCH] git-svn: make batch mode optional for git-cat-file

Eric Wong <normalperson@xxxxxxxx> wrote:
> Victor Leschuk <vleschuk@xxxxxxxxxxxxxxxx> wrote:
> > The thing is that git-cat-file keeps growing during work when running
> > in "batch" mode. See the figure attached: it is for cloning a rather
> > small repo (1 hour to clone about ~14000 revisions). However the clone
> > of a large repo (~280000 revisions) took about 2 weeks and
> > git-cat-file has outgrown the parent perl process several times
> > (git-cat-file - ~3-4Gb, perl - 400Mb).

How much of that is anonymous memory, though?
(pmap $PID_OF_GIT_CAT_FILE)

Running the following on the Linux kernel tree I had lying around:

(for i in $(seq 100 200); do git ls-files | sed -e "s/^/HEAD~$i:/"; done)|\
  git cat-file --batch >/dev/null

Reveals about 510M RSS in top, but pmap says less than 20M of that
is anonymous.  So the rest are mmap-ed packfiles; that RSS gets
transparently released back to the kernel under memory pressure.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html