On 19/03/2024 21.44, Yan Zhai wrote:
This changeset fixes a common problem for busy networking kthreads. These threads, e.g. NAPI threads, typically will do: * polling a batch of packets * if there are more work, call cond_resched() to allow scheduling * continue to poll more packets when rx queue is not empty We observed this being a problem in production, since it can block RCU tasks from making progress under heavy load. Investigation indicates that just calling cond_resched() is insufficient for RCU tasks to reach quiescent states. This also has the side effect of frequently clearing the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result, schedule() will not be called in these circumstances, despite schedule() in fact provides required quiescent states. This at least affects NAPI threads, napi_busy_loop, and also cpumap kthread. By reporting RCU QSes in these kthreads periodically before cond_resched, the blocked RCU waiters can correctly progress. Instead of just reporting QS for RCU tasks, these code share the same concern as noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe"). So report a consolidated QS for safety. It is worth noting that, although this problem is reproducible in napi_busy_loop, it only shows up when setting the polling interval to as high as 2ms, which is far larger than recommended 50us-100us in the documentation. So napi_busy_loop is left untouched. Lastly, this does not affect RT kernels, which does not enter the scheduler through cond_resched(). Without the mentioned side effect, schedule() will be called time by time, and clear the RCU task holdouts. V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@xxxxxxxxxxxxxx/ V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@xxxxxxxxxx/t/ V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t changes since v4: * polished comments and docs for the RCU helper as Paul McKenney suggested changes since v3: * fixed kernel-doc errors changes since v2: * created a helper in rcu header to abstract the behavior * fixed cpumap kthread in addition changes since v1: * disable preemption first as Paul McKenney suggested Yan Zhai (3): rcu: add a helper to report consolidated flavor QS net: report RCU QS on threaded NAPI repolling bpf: report RCU QS in cpumap kthread include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ kernel/bpf/cpumap.c | 3 +++ net/core/dev.c | 3 +++ 3 files changed, 37 insertions(+)
Acked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>