Re: [PATCH] commit: avoid parent list buildup in clear_commit_marks_many()

René Scharfe <l.s.r@xxxxxx> · Thu, 13 Feb 2025 22:38:51 +0100

Am 12.02.25 um 08:05 schrieb Patrick Steinhardt:
> On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote:
>> clear_commit_marks_1() clears the marks of the first parent and its
>> first parent and so on, and saves the higher numbered parents in a list
>> for later.  There is no benefit in keeping that list growing with each
>> handled commit.  Clear it after each run to reduce peak memory usage.
>
> Okay. So the issue here is that `clear_commit_marks_1()` only processes
> the first-parent chain of commits, and any other commits will be added
> to the `struct commit_list` backlog. Consequently, merge-heavy histories
> are very likely to build up quite a backlog of non-first-parent commits.
>
>> Signed-off-by: René Scharfe <l.s.r@xxxxxx>
>> ---
>>  commit.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/commit.c b/commit.c
>> index 540660359d..6efdb03997 100644
>> --- a/commit.c
>> +++ b/commit.c
>> @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist,
>>
>>  void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark)
>>  {
>> -	struct commit_list *list = NULL;
>> -
>>  	for (size_t i = 0; i < nr; i++) {
>> +		struct commit_list *list = NULL;
>> +
>>  		clear_commit_marks_1(&list, *commit, mark);
>> +		while (list)
>> +			clear_commit_marks_1(&list, pop_commit(&list), mark);
>>  		commit++;
>>  	}
>> -	while (list)
>> -		clear_commit_marks_1(&list, pop_commit(&list), mark);
>>  }
>
> And this is fixed by immediately processing all commits that we
> currently have in the list. As `clear_commit_marks_1()` only processes
> immediate children of the handed-in commit we know that it will have
> processed the first parent, and `list` will contain remaining commits,
> if any.

clear_commit_marks_1() processes the whole ancestral chain of first
parents down to the root or first clean ancestor.

> We also end up adding grandchildren to the list, so this change
> essentially converts the algorithm from depth-first to breadth-first.

It's still depth-first, but all second and higher numbered parents added
to the list are cleaned before starting to clean the next commit from
the input list.  So we go from "clean straight down for each input
commit first and clean their side branches later" to "clean all
ancestors for each input commit".

> I bet we can construct cases where this will perform _worse_ than the
> current algorithm, e.g. when you have branch thickets where every commit
> is a merge: But I assume that for the most common cases this might be an
> improvement indeed.

I won't bet, but I'd like to see such a case.  Can't imagine one.

> The question to me is: does this actually matter in the real world? It
> would be nice to maybe get some numbers that demonstrate the improvement
> in a repository.

Well, the maximum list length for clear_commit_marks_many() calls with
nr > 1 in the test suite goes from 12 in t6600 to 4 with the patch.  Not
that exciting.  The question to me is: Why pile up parents in the list
when we can clean them earlier with no downside?  Or is there any?

René