Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Wed, 28 Feb 2024 10:37:00 -0500





On 2/27/2024 1:32 PM, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
>> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@xxxxxxxxxxxxxx> wrote:
>>> We noticed task RCUs being blocked when threaded NAPIs are very busy in
>>> production: detaching any BPF tracing programs, i.e. removing a ftrace
>>> trampoline, will simply block for very long in rcu_tasks_wait_gp. This
>>> ranges from hundreds of seconds to even an hour, severely harming any
>>> observability tools that rely on BPF tracing programs. It can be
>>> easily reproduced locally with following setup:
>>>
>>> ip netns add test1
>>> ip netns add test2
>>>
>>> ip -n test1 link add veth1 type veth peer name veth2 netns test2
>>>
>>> ip -n test1 link set veth1 up
>>> ip -n test1 link set lo up
>>> ip -n test2 link set veth2 up
>>> ip -n test2 link set lo up
>>>
>>> ip -n test1 addr add 192.168.1.2/31 dev veth1
>>> ip -n test1 addr add 1.1.1.1/32 dev lo
>>> ip -n test2 addr add 192.168.1.3/31 dev veth2
>>> ip -n test2 addr add 2.2.2.2/31 dev lo
>>>
>>> ip -n test1 route add default via 192.168.1.3
>>> ip -n test2 route add default via 192.168.1.2
>>>
>>> for i in `seq 10 210`; do
>>>  for j in `seq 10 210`; do
>>>     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>>>  done
>>> done
>>>
>>> ip netns exec test2 ethtool -K veth2 gro on
>>> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
>>> ip netns exec test1 ethtool -K veth1 tso off
>>>
>>> Then run an iperf3 client/server and a bpftrace script can trigger it:
>>>
>>> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
>>> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
>>> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>>>
>>> Above reproduce for net-next kernel with following RCU and preempt
>>> configuraitons:
>>>
>>> # RCU Subsystem
>>> CONFIG_TREE_RCU=y
>>> CONFIG_PREEMPT_RCU=y
>>> # CONFIG_RCU_EXPERT is not set
>>> CONFIG_SRCU=y
>>> CONFIG_TREE_SRCU=y
>>> CONFIG_TASKS_RCU_GENERIC=y
>>> CONFIG_TASKS_RCU=y
>>> CONFIG_TASKS_RUDE_RCU=y
>>> CONFIG_TASKS_TRACE_RCU=y
>>> CONFIG_RCU_STALL_COMMON=y
>>> CONFIG_RCU_NEED_SEGCBLIST=y
>>> # end of RCU Subsystem
>>> # RCU Debugging
>>> # CONFIG_RCU_SCALE_TEST is not set
>>> # CONFIG_RCU_TORTURE_TEST is not set
>>> # CONFIG_RCU_REF_SCALE_TEST is not set
>>> CONFIG_RCU_CPU_STALL_TIMEOUT=21
>>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
>>> # CONFIG_RCU_TRACE is not set
>>> # CONFIG_RCU_EQS_DEBUG is not set
>>> # end of RCU Debugging
>>>
>>> CONFIG_PREEMPT_BUILD=y
>>> # CONFIG_PREEMPT_NONE is not set
>>> CONFIG_PREEMPT_VOLUNTARY=y
>>> # CONFIG_PREEMPT is not set
>>> CONFIG_PREEMPT_COUNT=y
>>> CONFIG_PREEMPTION=y
>>> CONFIG_PREEMPT_DYNAMIC=y
>>> CONFIG_PREEMPT_RCU=y
>>> CONFIG_HAVE_PREEMPT_DYNAMIC=y
>>> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
>>> CONFIG_PREEMPT_NOTIFIERS=y
>>> # CONFIG_DEBUG_PREEMPT is not set
>>> # CONFIG_PREEMPT_TRACER is not set
>>> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>>>
>>> An interesting observation is that, while tasks RCUs are blocked,
>>> related NAPI thread is still being scheduled (even across cores)
>>> regularly. Looking at the gp conditions, I am inclining to cond_resched
>>> after each __napi_poll being the problem: cond_resched enters the
>>> scheduler with PREEMPT bit, which does not account as a gp for tasks
>>> RCUs. Meanwhile, since the thread has been frequently resched, the
>>> normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
>>> seems to have very little chance to kick in. Given the nature of "busy
>>> polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
>>> updated (other gp conditions), the result is that such NAPI thread is
>>> put on RCU holdouts list for indefinitely long time.
>>>
>>> This is simply fixed by mirroring the ksoftirqd behavior: after
>>> NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
>>> more blocking afterwards for the same setup.
>>>
>>> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
>>> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx>
>>> ---
>>>  net/core/dev.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 275fd5259a4a..6e41263ff5d3 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
>>>                                 net_rps_action_and_irq_enable(sd);
>>>                         }
>>>                         skb_defer_free_flush(sd);
> Please put a comment here stating that RCU readers cannot cross
> this point.
> 
> I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> RCU read-side critical section.  And a header comment noting that from
> an RCU perspective, it acts as a momentary enabling of preemption.

Agreed, also does PREEMPT_RT not have similar issue? I noticed Thomas had
added this to softirq.c [1] but the changelog did not have details.

Also optionally, I wonder if calling rcu_tasks_qs() directly is better
(for documentation if anything) since the issue is Tasks RCU specific. Also
code comment above the rcu_softirq_qs() call about cond_resched() not taking
care of Tasks RCU would be great!

Reviewed-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>

thanks,

 - Joel
[1]
@@ -381,8 +553,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
                pending >>= softirq_bit;
        }

-       if (__this_cpu_read(ksoftirqd) == current)
+       if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
+           __this_cpu_read(ksoftirqd) == current)
                rcu_softirq_qs();
+
        local_irq_disable();