Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/3/24 8:15 AM, Olivier Langlois wrote:
> On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote:
>>>
>>> I am definitely interested in running the profiler tools that you
>>> are
>>> proposing... Most of my problems are resolved...
>>>
>>> - I got rid of 99.9% if the NET_RX_SOFTIRQ
>>> - I have reduced significantly the number of NET_TX_SOFTIRQ
>>>    https://github.com/amzn/amzn-drivers/issues/316
>>> - No more rcu context switches
>>> - CPU2 is now nohz_full all the time
>>> - CPU1 local timer interrupt is raised once every 2-3 seconds for
>>> an
>>> unknown origin. Paul E. McKenney did offer me his assistance on
>>> this
>>> issue
>>> https://lore.kernel.org/rcu/367dc07b740637f2ce0298c8f19f8aec0bdec123.camel@xxxxxxxxxxxxxx/t/#u
>>
>> And I was just going to propose to ask Paul, but great to
>> see you beat me on that
>>
> My investigation has progressed... my cpu1 interrupts are nvme block
> device interrupts.
> 
> I feel that for questions about block device drivers, this time, I am
> ringing at the experts door!
> 
> What is the meaning of a nvme interrupt?
> 
> I am assuming that this is to signal the completing of writing blocks
> in the device...
> I am currently looking in the code to find the answer for this.
> 
> Next, it seems to me that there is an odd number of interrupts for the
> device:
>  63:         12          0          0          0  PCI-MSIX-0000:00:04.0
> 0-edge      nvme0q0
>  64:          0      23336          0          0  PCI-MSIX-0000:00:04.0
> 1-edge      nvme0q1
>  65:          0          0          0      33878  PCI-MSIX-0000:00:04.0
> 2-edge      nvme0q2
> 
> why 3? Why not 4? one for each CPU...
> 
> If there was 4, I would have concluded that the driver has created a
> queue for each CPU...
> 
> How are the queues associated to certain request/task?
> 
> The file I/O is made by threads running on CPU3, so I find it
> surprising that nvmeq1 is choosen...
> 
> One noteworthy detail is that the process main thread is on CPU1. In my
> flawed mental model of 1 queue per CPU, there could be some sort of
> magical association with a process file descriptors table and the
> choosen block device queue but this idea does not hold... What would
> happen to processes running on CPU2...

The cpu <-> hw queue mappings for nvme devices depend on the topology of
the machine (number of CPUs, relation between thread siblings, number of
nodes, etc) and the number of queue available on the device in question.
If you have as many (or more) device side queues available as number of
CPUs, then you'll have a queue per CPU. If you have less, then multiple
CPUs will share a queue.

You can check the mappings in /sys/kernel/debug/block/<device>/

in there you'll find a number of hctxN folders, each of these is a
hardware queue. hcxt0/type tells you what kind of queue it is, and
inside the directory, you'll find which CPUs this queue is mapped to.
Example:

root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type 
default

"default" means it's a read/write queue, so it'll handle both reads and
writes.

root@r7625 /s/k/d/b/nvme0n1# ls hctx1/
active  cpu11/   dispatch       sched_tags         tags
busy    cpu266/  dispatch_busy  sched_tags_bitmap  tags_bitmap
cpu10/  ctx_map  flags          state              type

and we can see this hardware queue is mapped to cpu 10/11/266.

That ties into how these are mapped. It's pretty simple - if a task is
running on cpu 10/11/266 when it's queueing IO, then it'll use hw queue
1. This maps to the interrupts you found, but note that the admin queue
(which is not listed these directories, as it's not an IO queue) is the
first one there. hctx0 is nvme0q1 in your /proc/interrupts list.

If IO is queued on hctx1, then it should complete on the interrupt
vector associated with nvme0q2.

-- 
Jens Axboe





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux