Add Tejun to provide more clarification on cgroup side.
On 8/10/21 9:32 AM, Paul E. McKenney wrote:
On Tue, Aug 10, 2021 at 10:08:58AM +0200, Daniel Borkmann wrote:
[ +Paul ]
On 8/10/21 1:51 AM, Yonghong Song wrote:
[...]
I will hit the following rcu warning:
include/linux/cgroup.h:481 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by test_progs/260:
#0: ffffffffa5173360 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xa0
stack backtrace:
CPU: 1 PID: 260 Comm: test_progs Tainted: G O 5.14.0-rc2+ #176
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x7b
bpf_get_current_cgroup_id+0x9c/0xb1
bpf_prog_a29888d1c6706e09_test_sys_setdomainname+0x3e/0x89c
bpf_trampoline_6442469132_0+0x2d/0x1000
__x64_sys_setdomainname+0x5/0x110
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
I can get similar warning using bpf_get_current_ancestor_cgroup_id() helper.
syzbot reported a similar issue in [1] for syscall program. Helper
bpf_get_current_cgroup_id() or bpf_get_current_ancestor_cgroup_id()
has the following callchain:
task_dfl_cgroup
task_css_set
task_css_set_check
and we have
#define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \
lockdep_is_held(&cgroup_mutex) || \
lockdep_is_held(&css_set_lock) || \
((task)->flags & PF_EXITING) || (__c))
Since cgroup_mutex/css_set_lock is not held and the task
is not existing and rcu read_lock is not held, a warning
will be issued. Note that bpf sleepable program is protected by
rcu_read_lock_trace().
The above sleepable bpf programs are already protected
by migrate_disable(). Adding rcu_read_lock() in these
two helpers will silence the above warning.
I marked the patch fixing 95b861a7935b
("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
which added bpf_get_current_ancestor_cgroup_id() to tracing programs
in 5.14. I think backporting 5.14 is probably good enough as sleepable
progrems are not widely used.
This patch should fix [1] as well since syscall program is a sleepable
program protected with migrate_disable().
[1] https://lore.kernel.org/bpf/0000000000006d5cab05c7d9bb87@xxxxxxxxxx/
Reported-by: syzbot+7ee5c2c09c284495371f@xxxxxxxxxxxxxxxxxxxxxxxxx
Fixes: 95b861a7935b ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
Signed-off-by: Yonghong Song <yhs@xxxxxx>
---
kernel/bpf/helpers.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 62cf00383910..4567d2841133 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -353,7 +353,11 @@ const struct bpf_func_proto bpf_jiffies64_proto = {
#ifdef CONFIG_CGROUPS
BPF_CALL_0(bpf_get_current_cgroup_id)
{
- struct cgroup *cgrp = task_dfl_cgroup(current);
+ struct cgroup *cgrp;
+
+ rcu_read_lock();
+ cgrp = task_dfl_cgroup(current);
+ rcu_read_unlock();
return cgroup_id(cgrp);
I'm a bit confused, if cgroup object relies rcu_read_lock() and not rcu_read_lock_trace()
context, then the above is racy given you access the memory via cgroup_id() outside of it,
same below. If rcu_read_lock_trace() is enough and the above is really just to silence the
'suspicious rcu_dereference_check() usage' splat, then the rcu_dereference_check() from
task_css_set_check() should be extended to check for _trace() flavor instead [meaning, as
a cleaner workaround], which one is it?
Thanks for pointing out. Definitely a bug on my side. rcu_read_lock()
region should include cgroup_id(cgrp) as well.
At first glance, something like this would work from an RCU perspective:
BPF_CALL_0(bpf_get_current_cgroup_id)
{
struct cgroup *cgrp = task_dfl_cgroup(current);
struct cgroup *cgrp;
u64 ret;
rcu_read_lock();
cgrp = task_dfl_cgroup(current);
ret = cgroup_id(cgrp);
rcu_read_unlock();
return ret;
}
If I am reading the code correctly, rcu_read_lock() is required to keep
the cgroup from going away, and the rcu_read_lock_trace() is needed in
addition in order to keep the BPF program from going away.
There might well be better ways to do this, but the above obeys the
letter of the law. Or at least the law as I understand it. ;-)
Or does the fact that a cgroup contains a given task prevent that cgroup
from going away? (I -think- that tasks can be removed from cgroups
without the cooperation of those tasks, but then again there is much
about cgroups that I do not know.)
Thanx, Paul
}
@@ -366,9 +370,13 @@ const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
BPF_CALL_1(bpf_get_current_ancestor_cgroup_id, int, ancestor_level)
{
- struct cgroup *cgrp = task_dfl_cgroup(current);
+ struct cgroup *cgrp;
struct cgroup *ancestor;
+ rcu_read_lock();
+ cgrp = task_dfl_cgroup(current);
+ rcu_read_unlock();
+
ancestor = cgroup_ancestor(cgrp, ancestor_level);
if (!ancestor)
return 0;
Thanks,
Daniel