On Thu, Feb 29, 2024 at 9:47 PM Yan Zhai <yan@xxxxxxxxxxxxxx> wrote: > > We noticed task RCUs being blocked when threaded NAPIs are very busy at > workloads: detaching any BPF tracing programs, i.e. removing a ftrace > trampoline, will simply block for very long in rcu_tasks_wait_gp. This > ranges from hundreds of seconds to even an hour, severely harming any > observability tools that rely on BPF tracing programs. It can be > easily reproduced locally with following setup: > > ip netns add test1 > ip netns add test2 > > ip -n test1 link add veth1 type veth peer name veth2 netns test2 > > ip -n test1 link set veth1 up > ip -n test1 link set lo up > ip -n test2 link set veth2 up > ip -n test2 link set lo up > > ip -n test1 addr add 192.168.1.2/31 dev veth1 > ip -n test1 addr add 1.1.1.1/32 dev lo > ip -n test2 addr add 192.168.1.3/31 dev veth2 > ip -n test2 addr add 2.2.2.2/31 dev lo > > ip -n test1 route add default via 192.168.1.3 > ip -n test2 route add default via 192.168.1.2 > > for i in `seq 10 210`; do > for j in `seq 10 210`; do > ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 > done > done > > ip netns exec test2 ethtool -K veth2 gro on > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' > ip netns exec test1 ethtool -K veth1 tso off > > Then run an iperf3 client/server and a bpftrace script can trigger it: > > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' > > Above reproduce for net-next kernel with following RCU and preempt > configuraitons: > > # RCU Subsystem > CONFIG_TREE_RCU=y > CONFIG_PREEMPT_RCU=y > # CONFIG_RCU_EXPERT is not set > CONFIG_SRCU=y > CONFIG_TREE_SRCU=y > CONFIG_TASKS_RCU_GENERIC=y > CONFIG_TASKS_RCU=y > CONFIG_TASKS_RUDE_RCU=y > CONFIG_TASKS_TRACE_RCU=y > CONFIG_RCU_STALL_COMMON=y > CONFIG_RCU_NEED_SEGCBLIST=y > # end of RCU Subsystem > # RCU Debugging > # CONFIG_RCU_SCALE_TEST is not set > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_RCU_REF_SCALE_TEST is not set > CONFIG_RCU_CPU_STALL_TIMEOUT=21 > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0 > # CONFIG_RCU_TRACE is not set > # CONFIG_RCU_EQS_DEBUG is not set > # end of RCU Debugging > > CONFIG_PREEMPT_BUILD=y > # CONFIG_PREEMPT_NONE is not set > CONFIG_PREEMPT_VOLUNTARY=y > # CONFIG_PREEMPT is not set > CONFIG_PREEMPT_COUNT=y > CONFIG_PREEMPTION=y > CONFIG_PREEMPT_DYNAMIC=y > CONFIG_PREEMPT_RCU=y > CONFIG_HAVE_PREEMPT_DYNAMIC=y > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y > CONFIG_PREEMPT_NOTIFIERS=y > # CONFIG_DEBUG_PREEMPT is not set > # CONFIG_PREEMPT_TRACER is not set > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set > > An interesting observation is that, while tasks RCUs are blocked, > related NAPI thread is still being scheduled (even across cores) > regularly. Looking at the gp conditions, I am inclining to cond_resched > after each __napi_poll being the problem: cond_resched enters the > scheduler with PREEMPT bit, which does not account as a gp for tasks > RCUs. Meanwhile, since the thread has been frequently resched, the > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp) > seems to have very little chance to kick in. Given the nature of "busy > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq > updated (other gp conditions), the result is that such NAPI thread is > put on RCU holdouts list for indefinitely long time. > > This is simply fixed by adapting similar behavior of ksoftirqd: after > the thread repolls for a while, raise a RCU QS to help expedite the > tasks RCU grace period. No more blocking afterwards. > > Some brief iperf3 throughput testing in my VM with net-next kernel shows no > noteable perf difference with 1500 byte MTU for 10 repeat runs each: > > Before: > UDP: 3.073Gbps (+-0.070Gbps) > TCP: 37.850Gbps (+-1.947Gbps) > > After: > UDP: 3.077Gbps (+-0.121 Gbps) > TCP: 38.120Gbps (+-2.272 Gbps) > > Note I didn't enable GRO for UDP so its throughput is lower than TCP. > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") > Suggested-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > Reviewed-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> > Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> > --- > v1->v2: moved rcu_softirq_qs out from bh critical section, and only > raise it after a second of repolling. Added some brief perf test result. > Link to v1: https://lore.kernel.org/netdev/Zd4DXTyCf17lcTfq@debian.debian/T/#u And I apparently forgot to rename the subject since it's not raising after every poll (let me know if it is prefered to send a V3 to fix it) thanks Yan > --- > net/core/dev.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 275fd5259a4a..76cff3849e1f 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6751,9 +6751,12 @@ static int napi_threaded_poll(void *data) > { > struct napi_struct *napi = data; > struct softnet_data *sd; > + unsigned long next_qs; > void *have; > > while (!napi_thread_wait(napi)) { > + next_qs = jiffies + HZ; > + > for (;;) { > bool repoll = false; > > @@ -6778,6 +6781,21 @@ static int napi_threaded_poll(void *data) > if (!repoll) > break; > > + /* cond_resched cannot unblock tasks RCU writers, so it > + * is necessary to relax periodically and raise a QS to > + * avoid starving writers under frequent repoll, e.g. > + * ftrace trampoline clean up work. When not repoll, > + * napi_thread_wait will enter sleep and have the same > + * QS effect. > + */ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && > + time_after(jiffies, next_qs)) { > + preempt_disable(); > + rcu_softirq_qs(); > + preempt_enable(); > + next_qs = jiffies + HZ; > + } > + > cond_resched(); > } > } > -- > 2.30.2 >