On 2/27/2024 1:32 PM, Paul E. McKenney wrote: > On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote: >> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@xxxxxxxxxxxxxx> wrote: >>> We noticed task RCUs being blocked when threaded NAPIs are very busy in >>> production: detaching any BPF tracing programs, i.e. removing a ftrace >>> trampoline, will simply block for very long in rcu_tasks_wait_gp. This >>> ranges from hundreds of seconds to even an hour, severely harming any >>> observability tools that rely on BPF tracing programs. It can be >>> easily reproduced locally with following setup: >>> >>> ip netns add test1 >>> ip netns add test2 >>> >>> ip -n test1 link add veth1 type veth peer name veth2 netns test2 >>> >>> ip -n test1 link set veth1 up >>> ip -n test1 link set lo up >>> ip -n test2 link set veth2 up >>> ip -n test2 link set lo up >>> >>> ip -n test1 addr add 192.168.1.2/31 dev veth1 >>> ip -n test1 addr add 1.1.1.1/32 dev lo >>> ip -n test2 addr add 192.168.1.3/31 dev veth2 >>> ip -n test2 addr add 2.2.2.2/31 dev lo >>> >>> ip -n test1 route add default via 192.168.1.3 >>> ip -n test2 route add default via 192.168.1.2 >>> >>> for i in `seq 10 210`; do >>> for j in `seq 10 210`; do >>> ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 >>> done >>> done >>> >>> ip netns exec test2 ethtool -K veth2 gro on >>> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' >>> ip netns exec test1 ethtool -K veth1 tso off >>> >>> Then run an iperf3 client/server and a bpftrace script can trigger it: >>> >>> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& >>> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& >>> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' >>> >>> Above reproduce for net-next kernel with following RCU and preempt >>> configuraitons: >>> >>> # RCU Subsystem >>> CONFIG_TREE_RCU=y >>> CONFIG_PREEMPT_RCU=y >>> # CONFIG_RCU_EXPERT is not set >>> CONFIG_SRCU=y >>> CONFIG_TREE_SRCU=y >>> CONFIG_TASKS_RCU_GENERIC=y >>> CONFIG_TASKS_RCU=y >>> CONFIG_TASKS_RUDE_RCU=y >>> CONFIG_TASKS_TRACE_RCU=y >>> CONFIG_RCU_STALL_COMMON=y >>> CONFIG_RCU_NEED_SEGCBLIST=y >>> # end of RCU Subsystem >>> # RCU Debugging >>> # CONFIG_RCU_SCALE_TEST is not set >>> # CONFIG_RCU_TORTURE_TEST is not set >>> # CONFIG_RCU_REF_SCALE_TEST is not set >>> CONFIG_RCU_CPU_STALL_TIMEOUT=21 >>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0 >>> # CONFIG_RCU_TRACE is not set >>> # CONFIG_RCU_EQS_DEBUG is not set >>> # end of RCU Debugging >>> >>> CONFIG_PREEMPT_BUILD=y >>> # CONFIG_PREEMPT_NONE is not set >>> CONFIG_PREEMPT_VOLUNTARY=y >>> # CONFIG_PREEMPT is not set >>> CONFIG_PREEMPT_COUNT=y >>> CONFIG_PREEMPTION=y >>> CONFIG_PREEMPT_DYNAMIC=y >>> CONFIG_PREEMPT_RCU=y >>> CONFIG_HAVE_PREEMPT_DYNAMIC=y >>> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y >>> CONFIG_PREEMPT_NOTIFIERS=y >>> # CONFIG_DEBUG_PREEMPT is not set >>> # CONFIG_PREEMPT_TRACER is not set >>> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set >>> >>> An interesting observation is that, while tasks RCUs are blocked, >>> related NAPI thread is still being scheduled (even across cores) >>> regularly. Looking at the gp conditions, I am inclining to cond_resched >>> after each __napi_poll being the problem: cond_resched enters the >>> scheduler with PREEMPT bit, which does not account as a gp for tasks >>> RCUs. Meanwhile, since the thread has been frequently resched, the >>> normal scheduling point (no PREEMPT bit, accounted as a task RCU gp) >>> seems to have very little chance to kick in. Given the nature of "busy >>> polling" program, such NAPI thread won't have task->nvcsw or task->on_rq >>> updated (other gp conditions), the result is that such NAPI thread is >>> put on RCU holdouts list for indefinitely long time. >>> >>> This is simply fixed by mirroring the ksoftirqd behavior: after >>> NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No >>> more blocking afterwards for the same setup. >>> >>> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") >>> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> >>> --- >>> net/core/dev.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/net/core/dev.c b/net/core/dev.c >>> index 275fd5259a4a..6e41263ff5d3 100644 >>> --- a/net/core/dev.c >>> +++ b/net/core/dev.c >>> @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data) >>> net_rps_action_and_irq_enable(sd); >>> } >>> skb_defer_free_flush(sd); > Please put a comment here stating that RCU readers cannot cross > this point. > > I need to add lockdep to rcu_softirq_qs() to catch placing this in an > RCU read-side critical section. And a header comment noting that from > an RCU perspective, it acts as a momentary enabling of preemption. Agreed, also does PREEMPT_RT not have similar issue? I noticed Thomas had added this to softirq.c [1] but the changelog did not have details. Also optionally, I wonder if calling rcu_tasks_qs() directly is better (for documentation if anything) since the issue is Tasks RCU specific. Also code comment above the rcu_softirq_qs() call about cond_resched() not taking care of Tasks RCU would be great! Reviewed-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> thanks, - Joel [1] @@ -381,8 +553,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) pending >>= softirq_bit; } - if (__this_cpu_read(ksoftirqd) == current) + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && + __this_cpu_read(ksoftirqd) == current) rcu_softirq_qs(); + local_irq_disable();