On 18/06/2024 09.13, Sebastian Andrzej Siewior wrote:
The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used. - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list. At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists. The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided. Create a struct bpf_net_context which contains struct bpf_redirect_info. Define the variable on stack, use bpf_net_ctx_set() to save a pointer to it, bpf_net_ctx_clear() removes it again. The bpf_net_ctx_set() may nest. For instance a function can be used from within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and NET_TX_SOFTIRQ which does not. Therefore only the first invocations updates the pointer. Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. The returned data structure is zero initialized to ensure nothing is leaked from stack. This is done on first usage of the struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to note that initialisation is required. First invocation of bpf_net_ctx_get_ri() will memset() the data structure and update bpf_redirect_info::kern_flags. bpf_redirect_info::nh is excluded from memset because it is only used once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is moved past nh to exclude it from memset. The pointer to bpf_net_context is saved task's task_struct. Using always the bpf_net_context approach has the advantage that there is almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds. Cc: Alexei Starovoitov<ast@xxxxxxxxxx> Cc: Andrii Nakryiko<andrii@xxxxxxxxxx> Cc: Eduard Zingerman<eddyz87@xxxxxxxxx> Cc: Hao Luo<haoluo@xxxxxxxxxx> Cc: Jesper Dangaard Brouer<hawk@xxxxxxxxxx> Cc: Jiri Olsa<jolsa@xxxxxxxxxx> Cc: John Fastabend<john.fastabend@xxxxxxxxx> Cc: KP Singh<kpsingh@xxxxxxxxxx> Cc: Martin KaFai Lau<martin.lau@xxxxxxxxx> Cc: Song Liu<song@xxxxxxxxxx> Cc: Stanislav Fomichev<sdf@xxxxxxxxxx> Cc: Toke Høiland-Jørgensen<toke@xxxxxxxxxx> Cc: Yonghong Song<yonghong.song@xxxxxxxxx> Cc:bpf@xxxxxxxxxxxxxxx Acked-by: Alexei Starovoitov<ast@xxxxxxxxxx> Reviewed-by: Toke Høiland-Jørgensen<toke@xxxxxxxxxx> Signed-off-by: Sebastian Andrzej Siewior<bigeasy@xxxxxxxxxxxxx>
Acked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
--- include/linux/filter.h | 56 ++++++++++++++++++++++++++++++++++-------- include/linux/sched.h | 3 +++ kernel/bpf/cpumap.c | 3 +++ kernel/bpf/devmap.c | 9 ++++++- kernel/fork.c | 1 + net/bpf/test_run.c | 11 ++++++++- net/core/dev.c | 26 +++++++++++++++++++- net/core/filter.c | 44 +++++++++------------------------ net/core/lwt_bpf.c | 3 +++ 9 files changed, 111 insertions(+), 45 deletions(-)