On Wed, Feb 15, 2023 at 11:48:37AM +0100, Kumar Kartikeya Dwivedi wrote: > On Wed, Feb 15, 2023 at 07:58:10AM CET, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > The life time of certain kernel structures like 'struct cgroup' is protected by RCU. > > Hence it's safe to dereference them directly from kptr-s in bpf maps. > > The resulting pointer is PTR_TRUSTED and can be passed to kfuncs that expect KF_TRUSTED_ARGS. > > But I thought PTR_TRUSTED was meant to ensure that the refcount is always > 0? > E.g. in [0] you said that kernel code should ensure refcount is held while > passing trusted pointer as tracepoint args. It's also clear from what functions > that operate on PTR_TRUSTED are doing. bpf_cgroup_acquire is doing css_get and > not css_tryget. Similarly bpf_cgroup_ancestor also calls cgroup_get which does > css_get instead of css_tryget. > > [0]: https://lore.kernel.org/bpf/CAADnVQJfj9mrFZ+mBfwh8Xba333B6EyHRMdb6DE4s6te_5_V_A@xxxxxxxxxxxxxx In that link I was saying: " But imagine there is only RCU protected pointer that is passed into tracepoint somewhere. The verifier doesn't recognize refcnt++ on it and ref_obj_id == 0 the kernel code doesn't do refcnt++ either. But it's still safe and this arg should still be PTR_TO_BTF_ID | PTR_TRUSTED. The bpf prog can pass it further into kfunc that has KF_TRUSTED_ARGS. Since RCU is held before calling into tracepoint the bpf prog has to be non sleepable. Additional rcu_read_lock done by the prog is redundant, but doesn't hurt. When prog is calling kfunc the pointer is still valid and kfunc can safely operate on it assuming that object is not going away. That is the definition of KF_TRUSTED_ARGS from pov of kfunc. " I'm still saying the same. But you have a good point about bpf_cgroup_acquire(). It needs to do cgroup_tryget(). In general atomic_inc is only tiny bit faster than atomic_inc_not_zero(), so we can convert all KF_TRUSTED_ARGS kfuncs to deal with RCU protected pointers which refcnt could have reached zero. imo kptr_rcu closes usability gap we have with kptrs. Accessing the same kernel pointer from multiple cpus just not feasible with kptr_xchg. Inventing refcnt mechanisms and doing ++ on all cpus will be too slow. RCU is perfect for the case of lots of concurrent reads with few updates. I think we can remove KF_RCU as well. > And if we want to do RCU + css_tryget, we already have that in the form of > kptr_get. With kptr_rcu bpf_kptr_get() will become explicit in bpf progs. Instead of bpf_cgroup_kptr_get() the prog can do: bpf_rcu_read_lock(); // if sleepable cgrp = val->cgrp; if (cgrp) { cgrp_refcnted = bpf_cgroup_acquire(cgrp); if (cgrp_refcnted) ... Less special kfuncs the better. > I think we've had a similar discussion about this in > https://lore.kernel.org/bpf/20220216214405.tn7thpnvqkxuvwhd@xxxxxxxxxxxxxxxxxxxxxxxxxxxx, > where you advised against directly assuming pointers to RCU protected objects as > trusted where refcount could drop to 0. So then we went the kptr_get route, > because explicit RCU sections weren't available back then to load + inc_not_zero > directly (for sleepable programs). yep. bpf_rcu_read_lock() wasn't available at that time and it felt that the mechanism should work the same way in sleepable and non-sleepable bpf progs, but sleepable progs have OOM issue if they hold to any kernel object. So it's better to enforce in the verifier that both sleepable and non-sleepable access kernel objects under RCU. We don't have to do it for all kptrs, but for some it's probably necessary. > > > Derefrence of other kptr-s returns PTR_UNTRUSTED. > > > > For example: > > struct map_value { > > struct cgroup __kptr_rcu *cgrp; > > }; > > > > SEC("tp_btf/cgroup_mkdir") > > int BPF_PROG(test_cgrp_get_ancestors, struct cgroup *cgrp_arg, const char *path) > > { > > struct cgroup *cg, *cg2; > > > > cg = bpf_cgroup_acquire(cgrp_arg); // cg is PTR_TRUSTED and ref_obj_id > 0 > > bpf_kptr_xchg(&v->cgrp, cg); > > > > cg2 = v->cgrp; // cg2 is PTR_TRUSTED | MEM_RCU. This is new feature introduced by this patch. > > > > bpf_cgroup_ancestor(cg2, level); // safe to do. cg2 will not disappear > ^^ But it's percpu_ref > can drop to zero, right? Correct. bpf_cgroup_ancestor() also needs to do cgroup_tryget().