Re: [PATCH] am: don't pass strvec to apply_parse_options()

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 14 Dec 2022 09:44:12 +0100

On Tue, Dec 13 2022, René Scharfe wrote:

> Am 13.12.22 um 09:37 schrieb Ævar Arnfjörð Bjarmason:
>>
>> On Tue, Dec 13 2022, René Scharfe wrote:
>>
>>> apply_parse_options() passes the array of argument strings to
>>> parse_options(), which removes recognized options.  The removed strings
>>> are not freed, though.
>>>
>>> Make a copy of the strvec to pass to the function to retain the pointers
>>> of its strings, so we release them all at the end.
>>>
>>> Signed-off-by: René Scharfe <l.s.r@xxxxxx>
>>> ---
>>>  builtin/am.c | 12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/builtin/am.c b/builtin/am.c
>>> index 30c9b3a9cd..dddf1b9af0 100644
>>> --- a/builtin/am.c
>>> +++ b/builtin/am.c
>>> @@ -1476,6 +1476,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	int res, opts_left;
>>>  	int force_apply = 0;
>>>  	int options = 0;
>>> +	const char **apply_argv;
>>>
>>>  	if (init_apply_state(&apply_state, the_repository, NULL))
>>>  		BUG("init_apply_state() failed");
>>> @@ -1483,7 +1484,15 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	strvec_push(&apply_opts, "apply");
>>>  	strvec_pushv(&apply_opts, state->git_apply_opts.v);
>>>
>>> -	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
>>> +	/*
>>> +	 * Build a copy that apply_parse_options() can rearrange.
>>> +	 * apply_opts.v keeps referencing the allocated strings for
>>> +	 * strvec_clear() to release.
>>> +	 */
>>> +	ALLOC_ARRAY(apply_argv, apply_opts.nr);
>>> +	COPY_ARRAY(apply_argv, apply_opts.v, apply_opts.nr);
>>> +
>>> +	opts_left = apply_parse_options(apply_opts.nr, apply_argv,
>>>  					&apply_state, &force_apply, &options,
>>>  					NULL);
>>>
>>> @@ -1513,6 +1522,7 @@ static int run_apply(const struct am_state *state, const char *index_file)
>>>  	strvec_clear(&apply_paths);
>>>  	strvec_clear(&apply_opts);
>>>  	clear_apply_state(&apply_state);
>>> +	free(apply_argv);
>>>
>>>  	if (res)
>>>  		return res;
>>
>> I don't mind this going in, but it really feels like a bit of a dirty
>> hack.
>>
>> We have widespread leaks all over the place due to this
>> idiom. I.e. parse_options() and a couple of other APIs expect that they
>> can munge the "argv", which is fine if it arrives via main(), but not if
>> we're editing our own strvecs.
>
> Where?  A quick "git grep 'parse_options.*nr'" turns up only this place
> as one that passes a strvec to parse_options.

Sorry about the vagueness, this was from memory and I didn't have a full
list handy.

First, that grep isn't going to give you what you want, since it only
catches cases where we're passing the "strvec" to the "parse_options()"
directly.

But if you look at e.g. how cmd_stash() invokes push_stash() or
cmd_describe() invoking cmd_name_rev() you'll see callers where we're
potentially losing the strvec due to parse_options().

"Potentially" because in both those cases we call that API, but in the
latter case we've got a missing strvec_clear(), so maybe that's all
that's needed (it doesn't always munge the argv).

>> I think less of a hack is to teach the eventual parse_options() that
>> when it munges it it should free() it. I did that for the revisions API
>> in f92dbdbc6a8 (revisions API: don't leak memory on argv elements that
>> need free()-ing, 2022-08-02).
>>
>> What do you think?
>
> Generating string lists and then parsing them is weird.  When calls have
> to cross a process boundary then we have no choice, but in-process we
> shouldn't have to lower our request to an intermediate text format.  git
> am does it anyway because it writes its options to a file and reads them
> back when it resumes with --continue, IIUC.

I agree, but making all those use a nicer API would be a much bigger
change.

> I hope that is and will be the only place that uses parse_options() with
> a strvec -- and then we don't have to change that function.
>
> If this pattern is used more widely then we could package the copying
> done by this patch somehow, e.g. by adding a strvec_parse_options()
> that wraps the real thing.

The reason I'm encouraging you to look at some of the other strvec leaks
is to see if you can think of a pattern that fits these various leaky
callers.

So e.g. a strvec_parse_options() won't work for the ones noted above
where we've lost track of it being a "struct strvec" in the first place.

I have a local version somewhere where I did (pseudocode):

	struct strvec vec = STRVEC_INIT;
	struct string_list to_free = STRING_LIST_INIT_DUP;

        // populate vec...
        for (size_t i = 0; i < vec.nr; i++)
		string_list_append_nodup(&to_free, v[i]):
	// call parse_options(), or otherwise munge "vec"...
	free(strvec_detach(&vec));
        string_list_clear(&to_free, 0);

I.e. you snapshot the members of the "vec" before it's munged, and
free() those via a "struct string_list".

Then you don't strvec_clear(), but instead free() the container, its
members being free'd by the string_list.

Here's a (not yet in-tree) patch that uses that:
https://lore.kernel.org/git/patch-10.10-81f138e460c-20221017T115544Z-avarab@xxxxxxxxx/

I think that's more reliable & general than the pattern you're adding
here.

In your version the only reason we're not segfaulting is because the
parse_options() consumes all the arguments, but that's not something you
can generally rely on with parse_options().

Also, and maybe I'm missing something, but wouldn't this be functionally
the same as your implementation:

	diff --git a/builtin/am.c b/builtin/am.c
	index 30c9b3a9cd7..e65262cbb21 100644
	--- a/builtin/am.c
	+++ b/builtin/am.c
	@@ -1476,11 +1476,13 @@ static int run_apply(const struct am_state *state, const char *index_file)
	 	int res, opts_left;
	 	int force_apply = 0;
	 	int options = 0;
	+	char *to_free;

	 	if (init_apply_state(&apply_state, the_repository, NULL))
	 		BUG("init_apply_state() failed");

	 	strvec_push(&apply_opts, "apply");
	+	to_free = (char *)apply_opts.v[0];
	 	strvec_pushv(&apply_opts, state->git_apply_opts.v);

	 	opts_left = apply_parse_options(apply_opts.nr, apply_opts.v,
	@@ -1489,6 +1491,7 @@ static int run_apply(const struct am_state *state, const char *index_file)

	 	if (opts_left != 0)
	 		die("unknown option passed through to git apply");
	+	free(to_free);

	 	if (index_file) {
	 		apply_state.index_file = index_file;

I.e. we leak the strvec_push() of the "apply" literal, but for the rest
we append the "state->git_apply_opts.v", which we already free()
elsewhere.

So actually, aren't we really just fumbling our way towards the "nodup"
interface that the "struct string_list" has, and we should add the same
to the "struct strvec".

This seems to also fix all the leaks you fixed, but without any of the
copying:

	diff --git a/builtin/am.c b/builtin/am.c
	index 30c9b3a9cd7..691ec1d152d 100644
	--- a/builtin/am.c
	+++ b/builtin/am.c
	@@ -1471,7 +1471,7 @@ static int parse_mail_rebase(struct am_state *state, const char *mail)
	 static int run_apply(const struct am_state *state, const char *index_file)
	 {
	 	struct strvec apply_paths = STRVEC_INIT;
	-	struct strvec apply_opts = STRVEC_INIT;
	+	struct strvec apply_opts = STRVEC_INIT_NODUP;
	 	struct apply_state apply_state;
	 	int res, opts_left;
	 	int force_apply = 0;
	diff --git a/strvec.c b/strvec.c
	index 61a76ce6cb9..efdc47a5854 100644
	--- a/strvec.c
	+++ b/strvec.c
	@@ -22,7 +22,9 @@ static void strvec_push_nodup(struct strvec *array, const char *value)

	 const char *strvec_push(struct strvec *array, const char *value)
	 {
	-	strvec_push_nodup(array, xstrdup(value));
	+	const char *to_push = array->nodup_strings ? value : xstrdup(value);
	+
	+	strvec_push_nodup(array, to_push);
	 	return array->v[array->nr - 1];
	 }

	@@ -31,6 +33,9 @@ const char *strvec_pushf(struct strvec *array, const char *fmt, ...)
	 	va_list ap;
	 	struct strbuf v = STRBUF_INIT;

	+	if (array->nodup_strings)
	+		BUG("cannot strvec_pushf() on a 'nodup' strvec");
	+
	 	va_start(ap, fmt);
	 	strbuf_vaddf(&v, fmt, ap);
	 	va_end(ap);
	@@ -67,6 +72,9 @@ void strvec_pop(struct strvec *array)

	 void strvec_split(struct strvec *array, const char *to_split)
	 {
	+	if (array->nodup_strings)
	+		BUG("cannot strvec_split() on a 'nodup' strvec");
	+
	 	while (isspace(*to_split))
	 		to_split++;
	 	for (;;) {
	@@ -90,7 +98,7 @@ void strvec_clear(struct strvec *array)
	 	if (array->v != empty_strvec) {
	 		int i;
	 		for (i = 0; i < array->nr; i++)
	-			free((char *)array->v[i]);
	+			free(array->nodup_strings ? NULL : (char *)array->v[i]);
	 		free(array->v);
	 	}
	 	strvec_init(array);
	diff --git a/strvec.h b/strvec.h
	index 9f55c8766ba..ac6e2c04cea 100644
	--- a/strvec.h
	+++ b/strvec.h
	@@ -31,12 +31,18 @@ struct strvec {
	 	const char **v;
	 	size_t nr;
	 	size_t alloc;
	+	unsigned int nodup_strings:1;
	 };

	 #define STRVEC_INIT { \
	 	.v = empty_strvec, \
	 }

	+#define STRVEC_INIT_NODUP { \
	+	.v = empty_strvec, \
	+	.nodup_strings = 1, \
	+}
	+
	 /**
	  * Initialize an array. This is no different than assigning from
	  * `STRVEC_INIT`.

In any case, for this change (and leak fixes in general), could you
please also mark the now-passing leak-free tests. This fails after your
patch, but passes before.

The failure is a "good" one, as they're now leak-free, but should be
marked as such:

	GIT_TEST_PASSING_SANITIZE_LEAK=check GIT_TEST_SANITIZE_LEAK_LOG=true \
	make SANITIZE=leak test T="t4256-am-format-flowed.sh t0023-crlf-am.sh t4254-am-corrupt.sh t4257-am-interactive.sh t5403-post-checkout-hook.sh t4152-am-subjects.sh"

> If we have to change parse_options() at all then I'd prefer it to not
> free() anything (to keep it usable with main()'s parameters),...

I'm suggesting passing a flag to it to have it free() things it
NULL's.

Maybe that's a bad idea, but I don't see the incompatibility with
main(), for those we just wouldn't pass the flag.

It does have the inherent limitation that you couldn't mix strvec and
non-strvec parameters, i.e. if some of your elements are dup'd but
others aren't you'd need to arrange to have them all dup'd.

But I don't think that's an issue in practice, either we pass the fully
dup'd version, or main() parameters.

> but to
> reorder in a non-destructive way.  That would mean keeping the NULL
> sentinel where it is, and making sure all callers use only the returned
> argc to determine which arguments parse_options() didn't recognize.

I think this would work if parse_options() wasn't clobbering those
elements in some cases, and replacing them with new ones, but doesn't it
do that e.g. in parse_options_step()?

It also seems like a much more fragile change to try to ensure that
nothing uses the "argv" again, but only looks at the updated
"argc".

Both that & adding a "const" to it would be a huge change across the
codebase, so I'd like to avoid them, but at least the "const" method
would be helped along by the compiler.