Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> writes: > The XDP redirect process is two staged: > - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the > packet and makes decisions. While doing that, the per-CPU variable > bpf_redirect_info is used. > > - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info > and it may also access other per-CPU variables like xskmap_flush_list. > > At the very end of the NAPI callback, xdp_do_flush() is invoked which > does not access bpf_redirect_info but will touch the individual per-CPU > lists. > > The per-CPU variables are only used in the NAPI callback hence disabling > bottom halves is the only protection mechanism. Users from preemptible > context (like cpu_map_kthread_run()) explicitly disable bottom halves > for protections reasons. > Without locking in local_bh_disable() on PREEMPT_RT this data structure > requires explicit locking. > > PREEMPT_RT has forced-threaded interrupts enabled and every > NAPI-callback runs in a thread. If each thread has its own data > structure then locking can be avoided. > > Create a struct bpf_net_context which contains struct bpf_redirect_info. > Define the variable on stack, use bpf_net_ctx_set() to save a pointer to > it. Use the __free() annotation to automatically reset the pointer once > function returns. > The bpf_net_ctx_set() may nest. For instance a function can be used from > within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and > NET_TX_SOFTIRQ which does not. Therefore only the first invocations > updates the pointer. > Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct > bpf_redirect_info. > > On PREEMPT_RT the pointer to bpf_net_context is saved task's > task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU > variable (which is always NODE-local memory). Using always the > bpf_net_context approach has the advantage that there is almost zero > differences between PREEMPT_RT and non-PREEMPT_RT builds. > > Cc: Alexei Starovoitov <ast@xxxxxxxxxx> > Cc: Andrii Nakryiko <andrii@xxxxxxxxxx> > Cc: Eduard Zingerman <eddyz87@xxxxxxxxx> > Cc: Hao Luo <haoluo@xxxxxxxxxx> > Cc: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> > Cc: Jiri Olsa <jolsa@xxxxxxxxxx> > Cc: John Fastabend <john.fastabend@xxxxxxxxx> > Cc: KP Singh <kpsingh@xxxxxxxxxx> > Cc: Martin KaFai Lau <martin.lau@xxxxxxxxx> > Cc: Song Liu <song@xxxxxxxxxx> > Cc: Stanislav Fomichev <sdf@xxxxxxxxxx> > Cc: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> > Cc: Yonghong Song <yonghong.song@xxxxxxxxx> > Cc: bpf@xxxxxxxxxxxxxxx > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> [...] > @@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, > int xdp_n, struct xdp_cpumap_stats *stats, > struct list_head *list) > { > + struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; > int nframes; I think we need to zero-initialise all the context objects we allocate on the stack. The reason being that an XDP program can return XDP_REDIRECT without calling any of the redirect helpers first; which will lead to xdp_do_redirect() being called without any of the fields in struct bpf_redirect_info having being set. This can lead to a crash if the values happen to be the wrong value; and if we're not initialising the stack space used by this struct, we have no guarantees about what value they will end up with. We fixed a similar bug relatively recently, see: 5bcf0dcbf906 ("xdp: use flags field to disambiguate broadcast redirect") > void bpf_clear_redirect_map(struct bpf_map *map) > { > - struct bpf_redirect_info *ri; > - int cpu; > - > - for_each_possible_cpu(cpu) { > - ri = per_cpu_ptr(&bpf_redirect_info, cpu); > - /* Avoid polluting remote cacheline due to writes if > - * not needed. Once we pass this test, we need the > - * cmpxchg() to make sure it hasn't been changed in > - * the meantime by remote CPU. > - */ > - if (unlikely(READ_ONCE(ri->map) == map)) > - cmpxchg(&ri->map, map, NULL); > - } > + /* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF > + * program/ during NAPI callback. It is used during > + * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the > + * redirect callback afterwards. ri->map is cleared after usage. > + * The path has no explicit RCU read section but the local_bh_disable() > + * is also a RCU read section which makes the complete softirq callback > + * RCU protected. This in turn makes ri->map RCU protected and it is > + * sufficient to wait a grace period to ensure that no "ri->map == map" > + * exists. dev_map_free() removes the map from the list and then > + * invokes synchronize_rcu() after calling this function. > + */ > } With the zeroing of the stack variable mentioned above, I agree that this is not needed anymore, but I think we should just get rid of the function entirely and put a comment in devmap.c instead of the call to the (now empty) function. -Toke