Re: [PATCH v3 2/2] diff: add -I<regex> that ignores matching changes

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 16 Oct 2020 11:04:02 -0700

Phillip Wood <phillip.wood123@xxxxxxxxx> writes:

>> @@ -6405,6 +6428,11 @@ void diff_flush(struct diff_options *options)
>>   	DIFF_QUEUE_CLEAR(q);
>>   	if (options->close_file)
>>   		fclose(options->file);
>> +	for (i = 0; i < options->ignore_regex_nr; i++) {
>> +		regfree(options->ignore_regex[i]);
>> +		free(options->ignore_regex[i]);
>> +	}
>> +	free(options->ignore_regex);
>
> If I run `git log -p -I foo` then the address sanitizer reports
>
> AddressSanitizer: heap-use-after-free xdiff/xdiffi.c:1027 in
> record_matches_regex
>
> after it has printed the diff for the first commit. I think freeing
> the regex here is the cause of the problem.

Yes, this is a good location to clear the diff_queue, which is
per-invocation.  But diff_options->ignore_regex[] can and should be
shared across multiple invocations of the diff machinery, as we are
parsing upfront in the command line option parser just once to be
used throughout the life of the program.  It is wrong to free them
early like this.

I really hate to suggest this, one common API call made into the
diff machinery from various consumers, when they are absolutely done
with diff_options, is probably diff_result_code().  Its purpose is
to collect bits computed to participate in the overall result into
the final result code and anybody who ran one or more diff and wants
to learn the outcome must call it, so it is unlikely that callers
that matter are missing a call to finalize their uses of the diff
machinery.  So if freeing .ignore_regex[] is really necessary, it
could be used to tell us where to do so [*1*].

Unlike per-invocation resources that MUST be freed for repeated
invocations of the diff machinery made in "git log" family,
resources held for and repeatedly used during the entire session
like this may not have to be freed, to be honest, though.

Thanks.

[Footnote]

*1* Adding calls to free() in diff_result_code() directly is
    probably not a good idea.  Some callers may not even need result
    code (e.g.  "blame") and do not call it at all, but we may want
    to free the resource in diff_options that is not per-invocation.
    So if we were to do this, we probably need a new API function,
    perhaps diff_options_clear() or something, and call that from
    diff_result_code(), and then find callers that do not care about
    the result code and call diff_options_clear() directly from
    them.

    If a caller does a single diff_setup_done(), makes repeated
    calls to the diff machinery, and calls diff_result_code() once
    per each iteration (e.g. imagine "git log -p" that detects the
    status for each commit), then having the diff_options_clear()
    call in diff_result_code() will break the machinery, so if we
    were to follow the outline given in the previous paragraph, we
    need to make sure there is no such caller.  But I see that
    diff_result_code() does not clear the fields it looks at in the
    diff_options fields after it uses them, so the second and
    subsequent calls into the diff machinery by such a caller may
    already be broken (not in the "resource leakage" sense, but in
    the correctness sense).