Re: Selftest failures related to kern_sync_rcu()

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 13 Apr 2021 10:50:29 +0200

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:

> On Thu, Apr 8, 2021 at 12:34 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> Hi Andrii
>>
>> I'm getting some selftest failures that all seem to have something to do
>> with kern_sync_rcu() not being enough to trigger the kernel events that
>> the selftest expects:
>>
>> $ ./test_progs | grep FAIL
>> test_lookup_update:FAIL:map1_leak inner_map1 leaked!
>> #15/1 lookup_update:FAIL
>> #15 btf_map_in_map:FAIL
>> test_exit_creds:FAIL:null_ptr_count unexpected null_ptr_count: actual 0 == expected 0
>> #123/2 exit_creds:FAIL
>> #123 task_local_storage:FAIL
>> test_exit_creds:FAIL:null_ptr_count unexpected null_ptr_count: actual 0 == expected 0
>> #123/2 exit_creds:FAIL
>> #123 task_local_storage:FAIL
>>
>> They are all fixed by adding a sleep(1) after the call(s) to
>> kern_sync_rcu(), so I'm guessing it's some kind of
>> timing/synchronisation problem. Is there a particular kernel config
>> that's needed for the membarrier syscall trick to work? I've tried with
>> various settings of PREEMPT and that doesn't really seem to make any
>> difference...
>>
>
> If you check kern_sync_rcu(), it relies on membarrier() syscall
> (passing cmd = MEMBARRIER_CMD_SHARED == MEMBARRIER_CMD_GLOBAL).
> Now, looking at kernel sources:
>   - CONFIG_MEMBARRIER should be enabled for that syscall;
>   - it has some extra conditions:
>
>            case MEMBARRIER_CMD_GLOBAL:
>                 /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>                 if (tick_nohz_full_enabled())
>                         return -EINVAL;
>                 if (num_online_cpus() > 1)
>                         synchronize_rcu();
>                 return 0;
>
> Could it be that one of those conditions is not satisfied?

Aha, bingo! Found the membarrier syscall stuff, but for some reason
didn't think to actually read the code of it; and I was running this in
a VM with a single CPU, adding another fixed this. Thanks! :)

Do you think we could detect this in the tests? I suppose the
tick_nohz_full_enabled() check should already result in a visible
failure since that makes the syscall fail; but the CPU thing is silent,
so it would be nice with a hint. Could kern_sync_rcu() check the CPU
count and print a warning or fail if it is 1? Or maybe just straight up
fall back to sleep()'ing?

-Toke