On Tue, Aug 31, 2021 at 4:11 AM Martin KaFai Lau <kafai@xxxxxx> wrote: > > On Sun, Aug 29, 2021 at 11:52:03PM +0200, KP Singh wrote: > [ ... ] > > > > > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c > > > > index b305270b7a4b..7760bc4e9565 100644 > > > > --- a/kernel/bpf/bpf_local_storage.c > > > > +++ b/kernel/bpf/bpf_local_storage.c > > > > @@ -11,6 +11,8 @@ > > > > #include <net/sock.h> > > > > #include <uapi/linux/sock_diag.h> > > > > #include <uapi/linux/btf.h> > > > > +#include <linux/rcupdate_trace.h> > > > > +#include <linux/rcupdate_wait.h> > > > > > > > > #define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE) > > > > > > > > @@ -81,6 +83,22 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, > > > > return NULL; > > > > } > > > > > > > > +void bpf_local_storage_free_rcu(struct rcu_head *rcu) > > > > +{ > > > > + struct bpf_local_storage *local_storage; > > > > + > > > > + local_storage = container_of(rcu, struct bpf_local_storage, rcu); > > > > + kfree_rcu(local_storage, rcu); > > > > +} > > > > + > > > > +static void bpf_selem_free_rcu(struct rcu_head *rcu) > > > > +{ > > > > + struct bpf_local_storage_elem *selem; > > > > + > > > > + selem = container_of(rcu, struct bpf_local_storage_elem, rcu); > > > > + kfree_rcu(selem, rcu); > > > > +} > > > > + > > > > /* local_storage->lock must be held and selem->local_storage == local_storage. > > > > * The caller must ensure selem->smap is still valid to be > > > > * dereferenced for its smap->elem_size and smap->cache_idx. > > > > @@ -118,12 +136,12 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, > > > > * > > > > * Although the unlock will be done under > > > > * rcu_read_lock(), it is more intutivie to > > > > - * read if kfree_rcu(local_storage, rcu) is done > > > > + * read if the freeing of the storage is done > > > > * after the raw_spin_unlock_bh(&local_storage->lock). > > > > * > > > > * Hence, a "bool free_local_storage" is returned > > > > - * to the caller which then calls the kfree_rcu() > > > > - * after unlock. > > > > + * to the caller which then calls then frees the storage after > > > > + * all the RCU grace periods have expired. > > > > */ > > > > } > > > > hlist_del_init_rcu(&selem->snode); > > > > @@ -131,7 +149,7 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, > > > > SDATA(selem)) > > > > RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL); > > > > > > > > - kfree_rcu(selem, rcu); > > > > + call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu); > > > Although the common use case is usually storage_get() much more often > > > than storage_delete(), do you aware any performance impact for > > > the bpf prog that does a lot of storage_delete()? > > > > I have not really measured the impact on deletes, My understanding is > > that it should > > not impact the BPF program, but yes, if there are some critical > > sections that are prolonged > > due to a sleepable program "sleeping" too long, then it would pile up > > the callbacks. > > > > But this is not something new, as we have a similar thing in BPF > > trampolines. If this really > > becomes an issue, we could add a flag BPF_F_SLEEPABLE_STORAGE and only maps > > with this flag would be allowed in sleepable progs. > Agree that is similar to trampoline updates but not sure it is comparable > in terms of the frequency of elems being deleted here. e.g. many > short lived tcp connections created by external traffic. > > Adding a BPF_F_SLEEPABLE_STORAGE later won't work. It will break > existing sleepable bpf prog. > > I don't know enough on call_rcu_tasks_trace() here, so the > earlier question on perf/callback-pile-up implications in order to > decide if extra logic or knob is needed here or not. I will defer to the others, maybe Alexei and Paul, we could also just add the flag to not affect existing performance characteristics? > > > > > We could then wait for both critical sections only when this flag is > > set on the map. > > > > > > > > > > > > > return free_local_storage; > > > > } > > > > @@ -154,7 +172,8 @@ static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem) > > > > raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > > > > > > if (free_local_storage) > > > > - kfree_rcu(local_storage, rcu); > > > > + call_rcu_tasks_trace(&local_storage->rcu, > > > > + bpf_local_storage_free_rcu); > > > > } > > > > > > > > void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage, > > > > @@ -213,7 +232,8 @@ bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > > > struct bpf_local_storage_elem *selem; > > > > > > > > /* Fast path (cache hit) */ > > > > - sdata = rcu_dereference(local_storage->cache[smap->cache_idx]); > > > > + sdata = rcu_dereference_protected(local_storage->cache[smap->cache_idx], > > > > + bpf_local_storage_rcu_lock_held()); > > > There are other places using rcu_dereference() also. > > > e.g. in bpf_local_storage_update(). > > > Should they be changed also? > > > > From what I saw, the other usage of rcu_derference is in a nested > > (w.r.t to the RCU section that in bpf_prog_enter/exit) RCU > > read side critical section/rcu_read_{lock, unlock} so it should not be required. > hmm... not sure what nested or not has to do here. > It is likely we are talking different things. > Yeah, we were looking at different things. e.g. bpf_selem_unlink does not need to be changed as it is in a rcu_read_lock. But you are right there is another in bpf_local_storage_update which I will fix. > Did you turn on the lockdep rcu debugging in kconfig? Yes I have PROVE_RCU and LOCKDEP > > afaik, lockdep uses the second "check" argument to WARN on incorrect usage. > Even the right check is passed here, the later rcu_dereference() will still > complain because it only checks for rcu_read_lock_held(). > > Also, after another look, why rcu_dereference_protected() is used > here instead of rcu_dereference_check()? The spinlock is not acquired > here. The same comment goes for similar rcu_dereference_protected() usages. Good catch, it should be rcu_dereference_check. > > > > > If there are some that are not, then they need to be updated. Did I miss any? > > > > > > > > [ ... ] > > > > > > > --- a/net/core/bpf_sk_storage.c > > > > +++ b/net/core/bpf_sk_storage.c > > > > @@ -13,6 +13,7 @@ > > > > #include <net/sock.h> > > > > #include <uapi/linux/sock_diag.h> > > > > #include <uapi/linux/btf.h> > > > > +#include <linux/rcupdate_trace.h> > > > > > > > > DEFINE_BPF_STORAGE_CACHE(sk_cache); > > > > > > > > @@ -22,7 +23,8 @@ bpf_sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit) > > > > struct bpf_local_storage *sk_storage; > > > > struct bpf_local_storage_map *smap; > > > > > > > > - sk_storage = rcu_dereference(sk->sk_bpf_storage); > > > > + sk_storage = rcu_dereference_protected(sk->sk_bpf_storage, > > > > + bpf_local_storage_rcu_lock_held()); > > > > if (!sk_storage) > > > > return NULL; > > > > > > > > @@ -258,6 +260,7 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk, > > > > { > > > > struct bpf_local_storage_data *sdata; > > > > > > > > + WARN_ON_ONCE(!bpf_local_storage_rcu_lock_held()); > > > > if (!sk || !sk_fullsock(sk) || flags > BPF_SK_STORAGE_GET_F_CREATE) > > > sk is protected by rcu_read_lock here. > > > Is it always safe to access it with the rcu_read_lock_trace alone ? > > > > We don't dereference sk with an rcu_dereference though, is it still the case for > > tracing and LSM programs? Or is it somehow implicity protected even > > though we don't use rcu_dereference since that's just a READ_ONCE + some checks? > e.g. the bpf_prog (currently run under rcu_read_lock()) may read the sk from > req_sk->sk which I don't think the verifier will optimize it out, so as good > as READ_ONCE(), iiuc. > > The sk here is obtained from the bpf_lsm_socket_* hooks? Those sk should have > a refcnt, right? If that is the case, it should be good enough for now. The one passed in the arguments yes, but if you notice the discussion in https://lore.kernel.org/bpf/20210826133913.627361-1-memxor@xxxxxxxxx/T/#me254212a125516a6c5d2fbf349b97c199e66dce0 one may also get an sk in LSM and tracing progs by pointer walking. > > > > > Would grabbing a refcount earlier in the code be helpful? > Grab the refcount earlier in the bpf prog itself? > right, it may be what eventually needs to be done > if somehow there is a usecase that the sleepable bpf prog can get a > hold on a rcu protected sk.