Re: [RFC PATCH 0/5] strvec: add a "nodup" mode, fix memory leaks

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 19 Dec 2022 10:20:00 +0100

On Sat, Dec 17 2022, Jeff King wrote:

> On Thu, Dec 15, 2022 at 10:11:06AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> This is an alternative to René's [1], his already fixes a leak in "git
>> am", and this could be done later, so I'm submitting it as RFC, but it
>> could also replace it.
>> 
>> I think as this series shows extending the "strvec" API to get a
>> feature that works like the existing "strdup_strings" that the "struct
>> string_list" has can make memory management much simpler.
>
> I know this is kind of a surface level review, but...please don't do
> this. We have chased so many bugs over the years due to string-list's
> "maybe this is allocated and maybe not", in both directions (accidental
> leaks and double-frees).
>
> One of the reasons I advocated for strvec in the first place is so that
> it would have consistent memory management semantics, at the minor cost
> of sometimes duplicating them when we don't need to.
>
> And having a nodup form doesn't even save you from having to call
> strvec_clear(); you still need to do so to avoid leaking the array
> itself. It only helps in the weird parse-options case, where we don't
> handle ownership of the array very well (the strvec owns it, but
> parse-options wants to modify it).

Yes, just like "struct string_list" in the "nodup" mode.

I hear you, but I also think you're implicitly conflating two things
here.

There's the question of whether we should in general optimize for safety
over more optimila memory use. I.e. if we simply have every strvec,
string_list etc. own its memory fully we don't need to think as much
about allocation or ownership.

I think we should do that in general, but we also have cases where we'd
like to not do that, e.g. where we're adding thousands of strings to a
string_list, which are all borrewed from elsewhere, except for a few
we'd like to xstrdup().

Such API use *is* tricky, but I think formalizing it as the
"string_list" does is making it better, not worse. In particular...:

>> This does make the API slightly more dangerous to use, as it's no
>> longer guaranteed that it owns all the members it points to. But as
>> the "struct string_list" has shown this isn't an issue in practice,
>> and e.g. SANITIZE=address et al are good about finding double-frees,
>> or frees of fixed strings.
>
> I would disagree that this hasn't been an issue in practice. A few
> recent examples:
>
>   - 5eeb9aa208 (refs: fix memory leak when parsing hideRefs config,
>     2022-11-17)
>   - 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
>     strings, 2022-09-08)
>   - 4c81ee9669 (submodule--helper: fix "reference" leak, 2022-09-01)

...it's funny that those are the examples I would have dug up to argue
that this is a good idea, and to add some:

	- 4a0479086a9 (commit-graph: fix memory leak in misused
          string_list API, 2022-03-04)
	- b202e51b154 (grep: fix a "path_list" memory leak, 2021-10-22)

I.e. above you note "in both directions [...] leaks [...] and double
frees", but these (and the ones I added) are all in the second category.

Which is why I don't think it's an issue in practice. The leaks have
been a non-issue, and to the extent that we care the SANITIZE=leak
testing is closing that gap.

The dangerous issue is the double-free, but (and over the years I have
looked at pretty much every caller) I can't imagine a string_list
use-case where we:

 a) Actually still want to keep that memory optimization, i.e. it
    wouldn't be better by just saying "screw it, let's dup it".
 b) Given "a", we'd be better off with some bespoke custom pattern over
    the facility to do this with the "string_list".

So I really think we're in agreement about 99% of this, I just don't see
how *if* we want to do this why we're better of re-inventing this
particular wheel for every caller.