Alexei Starovoitov wrote: > On Fri, Nov 18, 2022 at 09:08:12AM -0600, David Vernet wrote: > > On Thu, Nov 17, 2022 at 10:04:27PM -0800, John Fastabend wrote: > > > > [...] > > > > > > > And last thing I was checking is because KF_SLEEPABLE is not set > > > > > this should be blocked from running on sleepable progs which would > > > > > break the call_rcu in the destructor. Maybe small nit, not sure > > > > > its worth it but might be nice to annotate the helper description > > > > > with a note, "will not work on sleepable progs" or something to > > > > > that effect. > > > > > > > > KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep, > > > > not whether the calling program can be sleepable. call_rcu() doesn't > > > > block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if > > > > a kfunc is sleepable, non-sleepable programs are not able to call it > > > > (and this is enforced in the verifier). > > > > > > OK but should these helpers be allowed in sleepable progs? I think > > > not. What stops this, (using your helpers): > > > > > > cpu0 cpu1 > > > ---- > > > v = insert_lookup_task(task) > > > kptr = bpf_kptr_xchg(&v->task, NULL); > > > if (!kptr) > > > return 0; > > > map_delete_elem() > > > put_task() > > > rcu_call > > > do_something_might_sleep() > > > put_task_struct > > > ... free > > the free won't happen here, because the kptr on cpu0 holds the refcnt. > bpf side never does direct free of kptr. It only inc/dec refcnt via kfuncs. > > > > kptr->[free'd memory] > > > > > > the insert_lookup_task will bump the refcnt on the acquire on map > > > insert. But the lookup doesn't do anything to the refcnt and the > > lookup from map doesn't touch kptrs in the value. > just reading v->kptr becomes PTR_UNTRUSTED with probe_mem protection. > > > > map_delete_elem will delete it. We have a check for spin_lock > > > types to stop them from being in sleepable progs. Did I miss a > > > similar check for these? > > > > So, in your example above, bpf_kptr_xchg(&v->task, NULL) will atomically > > xchg the kptr from the map, and so the map_delete_elem() call would fail > > with (something like) -ENOENT. In general, the semantics are similar to > > std::unique_ptr::swap() in C++. > > > > FWIW, I think KF_KPTR_GET kfuncs are the more complex / racy kfuncs to > > reason about. The reason is that we're passing a pointer to the map > > value containing a kptr directly to the kfunc (with the attempt of > > acquiring an additional reference if a kptr was already present in the > > map) rather than doing an xchg which atomically gets us the unique > > pointer if nobody else xchgs it in first. So with KF_KPTR_GET, someone > > else could come along and delete the kptr from the map while the kfunc > > is trying to acquire that additional reference. The race looks something > > like this: > > > > cpu0 cpu1 > > ---- > > v = insert_lookup_task(task) > > kptr = bpf_task_kptr_get(&v->task); > > map_delete_elem() > > put_task() > > rcu_call > > put_task_struct > > ... free > > if (!kptr) > > /* In this race example, this path will be taken. */ > > return 0; > > > > The difference is that here, we're not doing an atomic xchg of the kptr > > out of the map. Instead, we're passing a pointer to the map value > > containing the kptr directly to bpf_task_kptr_get(), which itself tries > > to acquire an additional reference on the task to return to the program > > as a kptr. This is still safe, however, as bpf_task_kptr_get() uses RCU > > and refcount_inc_not_zero() in the bpf_task_kptr_get() kfunc to ensure > > that it can't hit a UAF, and that it won't return a dying task to the > > caller: > > > > /** > > * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task > > * kptr acquired by this kfunc which is not subsequently stored in a map, must > > * be released by calling bpf_task_release(). > > * @pp: A pointer to a task kptr on which a reference is being acquired. > > */ > > __used noinline > > struct task_struct *bpf_task_kptr_get(struct task_struct **pp) > > { > > struct task_struct *p; > > > > rcu_read_lock(); > > p = READ_ONCE(*pp); > > > > /* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > * cpu1 could remove the element from the map here, and invoke > > * put_task_struct_rcu_user(). We're in an RCU read region > > * though, so the task won't be freed until at the very > > * earliest, the rcu_read_unlock() below. > > * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > */ > > > > if (p && !refcount_inc_not_zero(&p->rcu_users)) > > /* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > > * refcount_inc_not_zero() will return false, as cpu1 > > * deleted the element from the map and dropped its last > > * refcount. So we just return NULL as the task will be > > * deleted once an RCU gp has elapsed. > > * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > */ > > p = NULL; > > rcu_read_unlock(); > > > > return p; > > } > > > > Let me know if that makes sense. This stuff is tricky, and I plan to > > clearly / thoroughly add it to that kptr docs page once this patch set > > lands. > > All correct. Probably worth adding this comment directly in bpf_task_kptr_get. Yes also agree thanks for the details. Spent sometime trying to break it this event, but didn't find anything. Thanks.