Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread

Olivier Langlois <olivier@xxxxxxxxxxxxxx> · Sat, 03 Aug 2024 10:15:05 -0400

On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote:
> > 
> > I am definitely interested in running the profiler tools that you
> > are
> > proposing... Most of my problems are resolved...
> > 
> > - I got rid of 99.9% if the NET_RX_SOFTIRQ
> > - I have reduced significantly the number of NET_TX_SOFTIRQ
> >    https://github.com/amzn/amzn-drivers/issues/316
> > - No more rcu context switches
> > - CPU2 is now nohz_full all the time
> > - CPU1 local timer interrupt is raised once every 2-3 seconds for
> > an
> > unknown origin. Paul E. McKenney did offer me his assistance on
> > this
> > issue
> > https://lore.kernel.org/rcu/367dc07b740637f2ce0298c8f19f8aec0bdec123.camel@xxxxxxxxxxxxxx/t/#u
> 
> And I was just going to propose to ask Paul, but great to
> see you beat me on that
> 
My investigation has progressed... my cpu1 interrupts are nvme block
device interrupts.

I feel that for questions about block device drivers, this time, I am
ringing at the experts door!

What is the meaning of a nvme interrupt?

I am assuming that this is to signal the completing of writing blocks
in the device...
I am currently looking in the code to find the answer for this.

Next, it seems to me that there is an odd number of interrupts for the
device:
 63:         12          0          0          0  PCI-MSIX-0000:00:04.0
0-edge      nvme0q0
 64:          0      23336          0          0  PCI-MSIX-0000:00:04.0
1-edge      nvme0q1
 65:          0          0          0      33878  PCI-MSIX-0000:00:04.0
2-edge      nvme0q2

why 3? Why not 4? one for each CPU...

If there was 4, I would have concluded that the driver has created a
queue for each CPU...

How are the queues associated to certain request/task?

The file I/O is made by threads running on CPU3, so I find it
surprising that nvmeq1 is choosen...

One noteworthy detail is that the process main thread is on CPU1. In my
flawed mental model of 1 queue per CPU, there could be some sort of
magical association with a process file descriptors table and the
choosen block device queue but this idea does not hold... What would
happen to processes running on CPU2...