Unpredictable peak memory usage when using `git log` command

Yuri Karnilaev <karnilaev@xxxxxxxxx> · Fri, 30 Aug 2024 15:20:15 +0300

Hello,

I encountered an issue when using the `git log` command to retrieve commits in large repositories. My task is to iterate over all commits and output them in a specific format. However, my computer has limited memory, so I am looking for a way to reduce the memory consumption of this operation.

I tested two different commands on the `torvalds/linux` repository as an example of a large repository and noticed a significant difference in peak memory usage:

1. Processing all commits in one go:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 --numstat > 1.txt
```
Result:
```
real 594,01
user 562,22
sys 12,43
          7407976448  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              187437  page reclaims
              274228  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                1031  voluntary context switches
              287056  involuntary context switches
       5455479398547  instructions retired
       1828253079874  cycles elapsed
           135_616_064  peak memory footprint
```

2. Processing commits in batches:
```
/usr/bin/time -l -h -p git log --ignore-missing --pretty=format:%H%x02%P%x02%aN%x02%aE%x02%at%x00 -n 1000 --skip=1000000 --numstat > 1.txt
```
Result:
```
real 9,83
user 7,48
sys 0,40
          2390540288  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               93487  page reclaims
               52995  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                 634  voluntary context switches
               14183  involuntary context switches
         50173495540  instructions retired
         24906960156  cycles elapsed
          1_470_935_680  peak memory footprint
```

As you can see from the results, the peak memory usage when processing commits in batches is 10 times higher than when processing all commits in one go.
Can you please explain why this happens? Is there a way to work around this? Or maybe can you fix this in future Git versions?

Operating System: Mac OS 14.6.1 (23G93)
Git Version: 2.39.3 (Apple Git-146)

Best regards,
Yuri