On 8/3/24 8:15 AM, Olivier Langlois wrote: > On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote: >>> >>> I am definitely interested in running the profiler tools that you >>> are >>> proposing... Most of my problems are resolved... >>> >>> - I got rid of 99.9% if the NET_RX_SOFTIRQ >>> - I have reduced significantly the number of NET_TX_SOFTIRQ >>> https://github.com/amzn/amzn-drivers/issues/316 >>> - No more rcu context switches >>> - CPU2 is now nohz_full all the time >>> - CPU1 local timer interrupt is raised once every 2-3 seconds for >>> an >>> unknown origin. Paul E. McKenney did offer me his assistance on >>> this >>> issue >>> https://lore.kernel.org/rcu/367dc07b740637f2ce0298c8f19f8aec0bdec123.camel@xxxxxxxxxxxxxx/t/#u >> >> And I was just going to propose to ask Paul, but great to >> see you beat me on that >> > My investigation has progressed... my cpu1 interrupts are nvme block > device interrupts. > > I feel that for questions about block device drivers, this time, I am > ringing at the experts door! > > What is the meaning of a nvme interrupt? > > I am assuming that this is to signal the completing of writing blocks > in the device... > I am currently looking in the code to find the answer for this. > > Next, it seems to me that there is an odd number of interrupts for the > device: > 63: 12 0 0 0 PCI-MSIX-0000:00:04.0 > 0-edge nvme0q0 > 64: 0 23336 0 0 PCI-MSIX-0000:00:04.0 > 1-edge nvme0q1 > 65: 0 0 0 33878 PCI-MSIX-0000:00:04.0 > 2-edge nvme0q2 > > why 3? Why not 4? one for each CPU... > > If there was 4, I would have concluded that the driver has created a > queue for each CPU... > > How are the queues associated to certain request/task? > > The file I/O is made by threads running on CPU3, so I find it > surprising that nvmeq1 is choosen... > > One noteworthy detail is that the process main thread is on CPU1. In my > flawed mental model of 1 queue per CPU, there could be some sort of > magical association with a process file descriptors table and the > choosen block device queue but this idea does not hold... What would > happen to processes running on CPU2... The cpu <-> hw queue mappings for nvme devices depend on the topology of the machine (number of CPUs, relation between thread siblings, number of nodes, etc) and the number of queue available on the device in question. If you have as many (or more) device side queues available as number of CPUs, then you'll have a queue per CPU. If you have less, then multiple CPUs will share a queue. You can check the mappings in /sys/kernel/debug/block/<device>/ in there you'll find a number of hctxN folders, each of these is a hardware queue. hcxt0/type tells you what kind of queue it is, and inside the directory, you'll find which CPUs this queue is mapped to. Example: root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type default "default" means it's a read/write queue, so it'll handle both reads and writes. root@r7625 /s/k/d/b/nvme0n1# ls hctx1/ active cpu11/ dispatch sched_tags tags busy cpu266/ dispatch_busy sched_tags_bitmap tags_bitmap cpu10/ ctx_map flags state type and we can see this hardware queue is mapped to cpu 10/11/266. That ties into how these are mapped. It's pretty simple - if a task is running on cpu 10/11/266 when it's queueing IO, then it'll use hw queue 1. This maps to the interrupts you found, but note that the admin queue (which is not listed these directories, as it's not an IO queue) is the first one there. hctx0 is nvme0q1 in your /proc/interrupts list. If IO is queued on hctx1, then it should complete on the interrupt vector associated with nvme0q2. -- Jens Axboe