Re: [PATCH v2 7/7] kunit, slub: add test_kfree_rcu() and test_leak_destroy()

Vlastimil Babka <vbabka@xxxxxxx> · Sat, 21 Sep 2024 23:25:55 +0200

On 9/21/24 23:08, Guenter Roeck wrote:
> On 9/21/24 13:40, Vlastimil Babka wrote:
>> +CC kunit folks
>> 
>> On 9/20/24 15:35, Guenter Roeck wrote:
>>> Hi,
>> 
>> Hi,
>> 
>>> On Wed, Aug 07, 2024 at 12:31:20PM +0200, Vlastimil Babka wrote:
>>>> Add a test that will create cache, allocate one object, kfree_rcu() it
>>>> and attempt to destroy it. As long as the usage of kvfree_rcu_barrier()
>>>> in kmem_cache_destroy() works correctly, there should be no warnings in
>>>> dmesg and the test should pass.
>>>>
>>>> Additionally add a test_leak_destroy() test that leaks an object on
>>>> purpose and verifies that kmem_cache_destroy() catches it.
>>>>
>>>> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
>>>
>>> This test case, when run, triggers a warning traceback.
>>>
>>> kmem_cache_destroy TestSlub_kfree_rcu: Slab cache still has objects when called from test_leak_destroy+0x70/0x11c
>>> WARNING: CPU: 0 PID: 715 at mm/slab_common.c:511 kmem_cache_destroy+0x1dc/0x1e4
>> 
>> Yes that should be suppressed like the other slub_kunit tests do. I have
>> assumed it's not that urgent because for example the KASAN kunit tests all
>> produce tons of warnings and thus assumed it's in some way acceptable for
>> kunit tests to do.
>> 
> 
> I have all tests which generate warning backtraces disabled. Trying to identify
> which warnings are noise and which warnings are on purpose doesn't scale,
> so it is all or nothing for me. I tried earlier to introduce a patch series
> which would enable selective backtrace suppression, but that died the death
> of architecture maintainers not caring and people demanding it to be perfect
> (meaning it only addressed WARNING: backtraces and not BUG: backtraces,
> and apparently that wasn't good enough).

Ah, didn't know, too bad.

> If the backtrace is intentional (and I think you are saying that it is),
> I'll simply disable the test. That may be a bit counter-productive, but
> there is really no alternative for me.

It's intentional in the sense that the test intentionally triggers a
condition that normally produces a warning. Many if the slub kunit test do
that, but are able to suppress printing the warning when it happens in the
kunit context. I forgot to do that for the new test initially as the warning
there happens from a different path that those that already have the kunit
suppression, but we'll implement that suppression there too ASAP.

>>> That is, however, not the worst of it. It also causes boot stalls on
>>> several platforms and architectures (various arm platforms, arm64,
>>> loongarch, various ppc, and various x86_64). Reverting it fixes the
>>> problem. Bisect results are attached for reference.
>> 
>> OK, this part is unexpected. I assume you have the test built-in and not a
>> module, otherwise it can't affect boot? And by stall you mean a delay or a
> 
> Yes.
> 
>> complete lockup? I've tried to reproduce that with virtme, but it seemed
>> fine, maybe it's .config specific?
> 
> It is a complete lockup.
> 
>> 
>> I do wonder about the placement of the call of kunit_run_all_tests() from
>> kernel_init_freeable() as that's before a bunch of initialization finishes.
>> 
>> For example, system_state = SYSTEM_RUNNING; and rcu_end_inkernel_boot() only
>> happens later in kernel_init(). I wouldn't be surprised if that means
>> calling kfree_rcu() or rcu_barrier() or kvfree_rcu_barrier() as part of the
>> slub tests is too early.
>> 
>> Does the diff below fix the problem? Any advice from kunit folks? I could
>> perhaps possibly make the slab test module-only instead of tristate or do
>> some ifdef builtin on the problematic tests, but maybe it wouldn't be
>> necessary with kunit_run_all_tests() happening a bit later.
>> 
> 
> It does, at least based on my limited testing. However, given that the
> backtrace is apparently intentional, it doesn't really matter - I'll disable
> the test instead.

Right, thanks. I will notify you when the suppression is done so the test
can be re-enabled.

But as the lockup should not be caused by the warning itself, we will still
have to address it, as others configuring kunit tests built-in might see the
lockup, or you would again see it as well, once the warning is addressed and
test re-enabled.

Your testing suggests moving the kunit_run_all_tests() call might be the
right fix so let's see what kunit folks think. I would hope that no other
kunit test would become broken by being executed later in boot, because when
compiled as modules, it already happens even much later already.

Thanks!
Vlastimil

> Thanks,
> Guenter
>