Re: [PATCH] leak tests: add an interface to the LSAN_OPTIONS "suppressions"

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 27 Oct 2021 22:57:52 +0200

On Wed, Oct 27 2021, Jeff King wrote:

> On Wed, Oct 27, 2021 at 10:04:17AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> >   - it keeps the annotations near the code. Yes, that creates conflicts
>> >     when the code is changed (or the leak is actually fixed), but that's
>> >     a feature. It keeps them from going stale.
>> 
>> I fully agree with you in general for "good" uses of UNLEAK(). E.g. consider:
>> 
>>     struct strbuf buf = xmalloc(...);
>>     [...]
>>     UNLEAK(buf);
>> 
>> If I was fixing leaks in that sort of code what I pointed out downthread
>> in [1] wouldn't apply.
>
> I'm OK with such a use of UNLEAK(), though of course you could probably
> just free the buffer in that instance.
>
> But...
>
>> So to clarify, I'm not asking in [1] that UNLEAK() not be used at all
>> while we have in-flight leak fixes. I.e. I'd run into a textual
>> conflict, but that would be trivial to resolve, and obvious what the
>> semantic & textual conflict was.
>> 
>> Rather, it's not marking specific leaks, but UNLEAK() on a container
>> struct that's problematic.
>
> ...that's the whole point of UNLEAK(), IMHO. You know that the container
> struct is leaked, but you don't care because the program is about to
> exit, or there's no easy way to avoid the leak.
>
> So it's not the "container" element, but rather it can be a problem if
> people annotate too broadly (you will miss some leaks). In the case of
> rev_info, there is no way to _not_ leak right now, because it has no
> cleanup function.

It doesn't have one, but there are uses of setup_revisions() and
rev_info usage that don't leak, as that builtin/rev-list.c case shows.

I mean, in that case it's not doing much of anything, but at least we
test that setup_revisions() itself doesn't leak right now, but wouldn't
with UNLEAK().

> In the long run we should probably add one.

After the current in-flight patches I've got making diff_free() a bit
more useful, and then a revisions_release() that'll handle most cases,
which gets us started on mass-marking the tests as leak free.

So just FWIW I'm not saying "hey can we hold off on that UNLEAK() for
far future xyz", but for a thing I've got queued up that I'd rather not
start rewriting...

> [...]But in the meantime, annotating them so we can find the
> _interesting_ leaks is worth doing (and that was the whole point of
> adding UNLEAK in the first place.
>
> And I think the UNLEAK() calls in Taylor's patch are fine in that sense.
> While yes, _some_ runs of those commands will not leak, the point is to
> say that when they do leak, they are not interesting. And they are not
> interesting because that rev_info is in scope until the program exits,
> so anything it points to is only leaked at a moment where we no longer
> care.

I don't per-se care about leaks at program exit either, but am making
them leak-free as a proxy for making the relevant APIs useful for
persistent use.

So I think the historical approach of just sprinkling UNLEAK() since
we're exiting anyway is taking too narrow of a view.

In that case if we'd like to see if revisions.[ch] is leak-free we'd
need parallel test-helper tests (or whatever) for all its features, or
we can just piggy-back on every "git log" invocation in the test
suite...

>> So we wouldn't just be marking a known specific leak, but hiding all
>> leaks & non-leaks in container from the top-level, and thus hide
>> potential regressions, an addition to attaining the end-goal of marking
>> some specific test as passing.
>
> I'd argue it _is_ a known specific leak. It is rev_info going out of
> scope that causes the leak, not rev_info holding on to memory in things
> like its pending array.
>
> Now those can be interesting, too (if it no longer needs the memory, can
> it perhaps get rid of it). But IMHO all of that is pretty secondary to
> clearing the noise so you can find "true" leaks (ones where some
> sub-function really is allocating and then losing the pointer entirely,
> especially if it's called an arbitrary number of times).

I wasn't trying to get into the philosophical discussion of whether a
struct or a struct it contains can be said to be the "real" source of a
leak :)

Yeah, I agree that from that POV there's no difference.

I'm merely referring to the practical aspect of patches I've got queued
up, which do things like fix all leaks of rev_info.pending, mark tests
that now pass, fix another rev_info.whatever, mark tests etc.

For rev_info in particular it's a bit iffy to UNLEAK() it, since it also
contains pointers to stuff it doesn't "own", sometimes things are
allocated and handed off to it, sometimes for a while, sometimes
permanently.

So by UNLEAK()ing it you're not just unleaking "its" memory, but
potentially hiding leaks in APIs that need to interact with it in that
way. E.g. mailmap.c is one of those cases.

>> >   - leak-checkers only know where things are allocated, not who is
>> >     supposed to own them. So you're often annotating the wrong thing;
>> >     it's not a strdup() call which is buggy and leaking, but the
>> >     function five layers up the stack which was supposed to take
>> >     ownership and didn't.
>> 
>> As noted in [3] this case is because the LSAN suppressions list was in
>> play, so we excluded the meaningful part of the stack trace (which is
>> shown in that mail).
>
> I don't think that's true at all. The annotations you showed in that
> message, for example, are pointing at add_rev_cmdline(). But that is
> _not_ the source of the leak, nor where it should be fixed. It is a
> necessary part of how rev_info works. The leak is when the rev_info goes
> out of scope without anybody cleaning it up.
>
>> Hrm, now that I think about it I think that the cases where I had to
>> resort to valgrind to get around such crappy stacktraces (when that was
>> all I got, even without suppressions) were probably all because there
>> was an UNLEAK() in play...
>
> I don't see how UNLEAK() would impact stack traces. It should either
> make something not-leaked-at-all (in which case LSan will no longer
> mention it), or it does nothing (it throws some wasted memory into a
> structure which is itself not leaked).

Yes, I think either categorically wrong here, or it applies to some
other case I wasn't able to dig up. Or maybe not, doesn't Taylor's
example take it from "Direct leak" to "Indirect leak" with the
suppression in play? I think those were related somehow (but don't have
that in front of me as I type this out).

E.g. (to reinforce your point) try compiling with SANITIZE=leak and running:

    $ TZ=UTC t/helper/test-tool date show:format:%z 1466000000 +0200
    1466000000 -> +0000
    +0200 -> +0000

    =================================================================
    ==335188==ERROR: LeakSanitizer: detected memory leaks

    Direct leak of 3 byte(s) in 1 object(s) allocated from:
        #0 0x7f31cdd21db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
        #1 0x7f31cdb04e4a in __GI___strdup string/strdup.c:42

    SUMMARY: LeakSanitizer: 3 byte(s) leaked in 1 allocation(s).

Now try with "SANITIZE=" and valgrind can give you the "real" source:

    $ TZ=UTC valgrind --leak-check=full t/helper/test-tool date show:format:%z 1466000000 +0200
    ==336515== Memcheck, a memory error detector
    ==336515== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==336515== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
    ==336515== Command: t/helper/test-tool date show:format:%z 1466000000 +0200
    ==336515==
    1466000000 -> +0000
    +0200 -> +0000
    ==336515==
    ==336515== HEAP SUMMARY:
    ==336515==     in use at exit: 1,208 bytes in 13 blocks
    ==336515==   total heap usage: 50 allocs, 37 frees, 27,593 bytes allocated
    ==336515==
    ==336515== 3 bytes in 1 blocks are definitely lost in loss record 1 of 13
    ==336515==    at 0x483877F: malloc (vg_replace_malloc.c:307)
    ==336515==    by 0x49C6E4A: strdup (strdup.c:42)
    ==336515==    by 0x261E3A: xstrdup (wrapper.c:29)
    ==336515==    by 0x152785: parse_date_format (date.c:991)
    ==336515==    by 0x116A34: show_dates (test-date.c:39)
    ==336515==    by 0x116DE9: cmd__date (test-date.c:116)
    ==336515==    by 0x115227: cmd_main (test-tool.c:127)
    ==336515==    by 0x115334: main (common-main.c:52)
    ==336515==
    ==336515== LEAK SUMMARY:
    ==336515==    definitely lost: 3 bytes in 1 blocks
    ==336515==    indirectly lost: 0 bytes in 0 blocks
    ==336515==      possibly lost: 0 bytes in 0 blocks
    ==336515==    still reachable: 1,205 bytes in 12 blocks
    ==336515==         suppressed: 0 bytes in 0 blocks
    ==336515== Reachable blocks (those to which a pointer was found) are not shown.
    ==336515== To see them, rerun with: --leak-check=full --show-leak-kinds=all
    ==336515==
    ==336515== For lists of detected and suppressed errors, rerun with: -s
    ==336515== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

I don't know why it habitually loses track of seemingly easy-to-track
leaks sometimes, but it does. There isn't an UNLEAK() in play here.

When I run into these I try it with valgrind, fix the leak, then test
again with SANITIZE=leak. I haven't found cases where it actually misses
things, just cases where the stack traces suck.