Sighting: io_uring requests on unexpected core

"Wunderlich, Mark" <mark.wunderlich@xxxxxxxxx> · Fri, 3 Apr 2020 19:39:27 +0000

Hey all, Mark here again with another sighting.  If you're all WFH like myself during this virus period maybe this will provide you with a new puzzle to solve and pass the time, while helping to educate.  Our family big into puzzles during this period.

Here is the issue:
While performing an FIO test, for a single job thread pinned to a specific CPU, I can trap requests to the nvmf layer from a core and queue not aligned to the FIO specified CPU.
I can replicate this on the baseline of nvme-5.5-rc or nvme-5.7-rc1 branches of the infradead repository, with no other patches applied.
For a typical 30 second 4k 100% read test there will be over 2 million packets processed, with under 100 sent by this other CPU to a different queue.  When this occurs it causes a drop in performance of 1-3%.
My nvmf queue configuration is 1 nr_io_queue and 104 nr_poll_queues that equal the number of active cores in the system.  As indicated this is while running an FIO test using io_uring for 100% random read.  And have seen this with a queue depth of 1 batch 1, as well as queue depth 32 batch 8.

 The basic command line being:
/fio --filename=/dev/nvme0n1 --time_based --runtime=30 --ramp_time=10 --thread --rw=randrw --rwmixread=100 --refill_buffers --direct=1 --ioengine=io_uring --hipri --fixedbufs --bs=4k --iodepth=32 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=32 --iodepth_batch=8 --numjobs=1 --group_reporting --gtod_reduce=0 --disable_lat=0 --name=cpu3 --cpus_allowed=3

Adding monitoring within the code functions nvme_tcp_queue_request() and nvme_tcp_poll() I will see the following.  Polling from the expected CPU for different queues with different assigned CPU [queue->io_cpu].  And new queue request coming in on an unexpected CPU [not as directed on FIO invocation] indicating a queue context assigned with the same CPU value.  Note: even when requests come in on different CPU cores, all polling is from the same expected CPU core.
[  524.867622] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867686] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]
[  524.867693] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867755] nvme_tcp: nvme_tcp_queue_request: IO-Q [Queue CPU 75], [CPU 75]
[  524.867758] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]
[  524.867777] nvme_tcp: nvme_tcp_queue_request: IO-Q [Queue CPU 3], [CPU 3]
[  524.867781] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867853] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]
[  524.867864] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]

So, if someone can help solve this puzzle and help me understand what is causing this behavior that would be great.  Hard for me to think this is an expected, or beneficial behavior, to have a need to use some other core/queue for less than 100 requests out of over 2 million.

Good hunting --- Cheers , Mark