On 3/5/21 1:10 PM, Yonghong Song wrote:
On 3/5/21 12:38 PM, Roman Gushchin wrote:
On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
hi,
I'm getting attached BUG/crash when running in parralel selftests, like:
while :; do ./test_progs -t spinlock; done
while :; do ./test_progs ; done
it's the latest bpf-next/master, I can send the .config if needed,
but I don't think there's anything special about it, because I saw
the bug on other servers with different generic configs
it looks like it's related to cgroup local storage, for some reason
the storage deref returns NULL
I'm bit lost in this code, so any help would be great ;-)
Hi!
I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
unkillable BPF_PROG_TEST_RUN").
Thanks, Roman, I did some experiments and found the reason of NULL
storage deref is because a tracing program (mostly like a kprobe) is run
after bpf_cgroup_storage_set() is called but before bpf program calls
bpf_get_local_storage(). Note that trace_call_bpf() macro
BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
Prior to it, we were running the test program in the preempt_disable()
&& rcu_read_lock()
section:
preempt_disable();
rcu_read_lock();
bpf_cgroup_storage_set(storage);
ret = BPF_PROG_RUN(prog, ctx);
rcu_read_unlock();
preempt_enable();
So, a percpu variable with a cgroup local storage pointer couldn't go
away.
I think even with using preempt_disable(), if the bpf program calls map
lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
bpf_cgroup_storage_set() too. I need to write a test case to confirm
this though.
After df1a2cb7c74b we can temporarily enable the preemption, so
nothing prevents
another program to call into bpf_cgroup_storage_set() on the same cpu.
I guess it's exactly what happens here.
It is. I confirmed.
Actually, the failure is not due to breaking up preempt_disable(). Even
with adding cond_resched(), bpf_cgroup_storage_set() still happens
inside the preempt region. So it is okay. What I confirmed is that
changing migration_{disable/enable} to preempt_{disable/enable} fixed
the issue.
So migration_{disable/enable} is the issue since any other process (and
its bpf program) and preempted the current process/bpf program and run.
Currently for each program, we will set the local storage before the
program run and each program may access to multiple local storage
maps. So we might need something similar sk_local_storage.
Considering possibility of deep nested migration_{disable/enable},
the cgroup local storage has to be preserved in prog/map data
structure and not as a percpu variable as it will be very hard
to save and restore percpu virable content as migration can
happen anywhere. I don't have concrete design yet. Just throw some
idea here.
BTW, I send a patch to prevent tracing programs to mess up
with cgroup local storage:
https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@xxxxxx/T/#u
we now all programs access cgroup local storage should be in
process context and we don't need to worry about kprobe-induced
percpu local storage access.
One option to fix it is to make bpf_cgroup_storage_set() to return the
old value,
save it on a local variable and restore after the execution of the
program.
In this particular case, we are doing bpf_test_run, we explicitly
allocate storage and call bpf_cgroup_storage_set() right before
each BPF_PROG_RUN.
But I didn't follow closely the development of sleepable bpf programs,
so I could
easily miss something.
Yes, sleepable bpf program is another complication. I think we need a
variable similar to bpf_prog_active, which should not nested bpf program
execution for those bpf programs having local_storage map.
Will try to craft some patch to facilitate the discussion.
[...]