Re: Unpredictable peak memory usage when using `git log` command

Jeff King <peff@xxxxxxxx> · Fri, 30 Aug 2024 17:06:07 -0400

On Fri, Aug 30, 2024 at 03:20:15PM +0300, Yuri Karnilaev wrote:

> 2. Processing commits in batches:
> ```
> /usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
> ```
> [...]
> Operating System: Mac OS 14.6.1 (23G93)
> Git Version: 2.39.3 (Apple Git-146)

I sent a patch which I think should make things better for you, but I
wanted to mention two things in a more general way:

  1. You should really consider building a commit-graph file with "git
     commit-graph write --reachable". That will reduce the memory usage
     for this case, but also improve the CPU quite a bit (we won't have
     to open those million skipped commits to chase their parent
     pointers).

     I haven't kept up with the defaults for writing graph files. I
     thought gc.writeCommitGraph defaults to "true" these days, though
     that wouldn't help in a freshly cloned repository (arguably we
     should write the commit graph on clone?).

  2. Using "--skip" still has to traverse all of those intermediate
     commits. So it's effectively quadratic in the number of commits
     overall (you end up skipping the first 1000 over and over).

     It's been a while since I've had to "paginate" segments of history
     like this, but a better solution is along the lines of:

       - use "-n 1000" to get 1000 commits in each chunk

       - use "--boundary" to report the commits that were queued to be
	 traversed next but weren't shown

       - in invocations after the first one, start the traversal at
	 those boundary commits, rather than HEAD

     You'll probably need to add "%m" to your format to show the
     boundaries (or alternatively, you can do the commit selection with
     rev-list, and then output the result to "log --no-walk --stdin" to
     do the pretty-printing).

-Peff