Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 31 Jul 2024 02:00:11 +0100

On 7/31/24 01:33, Pavel Begunkov wrote:
On 7/31/24 00:14, Olivier Langlois wrote:
On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote:

Removing an entry or two once every minute is definitely not
going to take 50% CPU, RCU machinery is running in background
regardless of whether io_uring uses it or not, and it's pretty
cheap considering ammortisation.

If anything it more sounds from your explanation like the
scheduler makes a wrong decision and schedules out the sqpoll
thread even though it could continue to run, but that's need
a confirmation. Does the CPU your SQPOLL is pinned to stays
100% utilised?

Here are the facts as they are documented in the github issue.

1. despite thinking that I was doing NAPI busy polling, I was not
because my ring was not receiving any sqe after its initial setup.

This is what the patch developped with your input
https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@xxxxxxxxxxxxxx/T/#u

is addressing

(BTW, I should check if there is such a thing, but I would love to know
if the net code is exposing a tracepoint when napi_busy_poll is called
because it is very tricky to know if it is done for real or not)

2. the moment a second ring has been attached to the sqpoll thread that
was receving a lot of sqe, the NAPI busy loop started to be made for
real and the sqpoll cpu usage unexplicably dropped from 99% to 55%

3. here is my kernel cmdline:
hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2
rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1
intel_idle.max_cstate=1 nmi_watchdog=0

there is absolutely nothing else on CPU0 where the sqpoll thread
affinity is set to run.

4. I got the idea of doing this:
echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable

and I have recorded over 1,000 context switches in 23 seconds with RCU
related kernel threads.

5. just for the fun of checking out, I have disabled NAPI polling on my
io_uring rings and the sqpoll thread magically returned to 99% CPU
usage from 55%...

I am open to other explanations for what I have observed but my current
conclusion is based on what I am able to see... the evidence appears
very convincing to me...

You're seeing something that doesn't make much sense to me, and we need
to understand what that is. There might be a bug _somewhere_, that's
always a possibility, but before saying that let's get a bit more data.

While the app is working, can you grab a profile and run mpstat for the
CPU on which you have the SQPOLL task?

perf record -g -C <CPU number> --all-kernel &
mpstat -u -P <CPU number> 5 10 &

And then as usual, time it so that you have some activity going on,
mpstat interval may need adjustments, and perf report it as before

I forgot to add, ~50 switches/second for relatively brief RCU handling
is not much, not enough to take 50% of a CPU. I wonder if sqpoll was
still running but napi busy polling time got accounted to softirq
because of disabled bh and you didn't include it, hence asking CPU
stats. Do you see any latency problems for that configuration?

--
Pavel Begunkov