Re: Sighting: io_uring requests on unexpected core

Sagi Grimberg <sagi@xxxxxxxxxxx> · Fri, 3 Apr 2020 14:44:36 -0700

Hey all, Mark here again with another sighting.

Hey Mark,

If you're all WFH like myself during this virus period maybe this will provide you with a new puzzle to solve and pass the time, while helping to educate.  Our family big into puzzles during this period.

:-)

Here is the issue:
While performing an FIO test, for a single job thread pinned to a specific CPU, I can trap requests to the nvmf layer from a core and queue not aligned to the FIO specified CPU.
I can replicate this on the baseline of nvme-5.5-rc or nvme-5.7-rc1 branches of the infradead repository, with no other patches applied.
For a typical 30 second 4k 100% read test there will be over 2 million packets processed, with under 100 sent by this other CPU to a different queue.  When this occurs it causes a drop in performance of 1-3%.
My nvmf queue configuration is 1 nr_io_queue and 104 nr_poll_queues that equal the number of active cores in the system.

Given that you pin your fio thread the high poll queue count shouldn't
really matter I assume.

As indicated this is while running an FIO test using io_uring for 100% random read.  And have seen this with a queue depth of 1 batch 1, as well as queue depth 32 batch 8.

  The basic command line being:
/fio --filename=/dev/nvme0n1 --time_based --runtime=30 --ramp_time=10 --thread --rw=randrw --rwmixread=100 --refill_buffers --direct=1 --ioengine=io_uring --hipri --fixedbufs --bs=4k --iodepth=32 --iodepth_batch_complete_min=1 --iodepth_batch_complete_max=32 --iodepth_batch=8 --numjobs=1 --group_reporting --gtod_reduce=0 --disable_lat=0 --name=cpu3 --cpus_allowed=3

Adding monitoring within the code functions nvme_tcp_queue_request() and nvme_tcp_poll() I will see the following.  Polling from the expected CPU for different queues with different assigned CPU [queue->io_cpu].  And new queue request coming in on an unexpected CPU [not as directed on FIO invocation] indicating a queue context assigned with the same CPU value.  Note: even when requests come in on different CPU cores, all polling is from the same expected CPU core.

nvme_tcp_poll: [Queue CPU 3], [CPU 3] means that the poll is is called
on cpu core [3] on a queue that is mapped to cpu core [3] correct?

nvme_tcp_poll: [Queue CPU 75], [CPU 3] means that the poll is is called
on cpu core [3] on a queue that is mapped to cpu core [75] correct?

[  524.867622] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867686] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]

I'm assuming that this is a poll invocation of a prior submission to
queue that is mapped to CPU 75?

[  524.867693] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867755] nvme_tcp: nvme_tcp_queue_request: IO-Q [Queue CPU 75], [CPU 75]

This log print means that on cpu core [3] we see a request submitted on
a queue that is mapped to cpu core [75] correct?

[  524.867758] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]
[  524.867777] nvme_tcp: nvme_tcp_queue_request: IO-Q [Queue CPU 3], [CPU 3]
[  524.867781] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]
[  524.867853] nvme_tcp:        nvme_tcp_poll: [Queue CPU 75], [CPU 3]
[  524.867864] nvme_tcp:        nvme_tcp_poll: [Queue CPU 3], [CPU 3]

So, if someone can help solve this puzzle and help me understand what is causing this behavior that would be great.  Hard for me to think this is an expected, or beneficial behavior, to have a need to use some other core/queue for less than 100 requests out of over 2 million.

I'm assuming that this phenomenon happens also without polling?

Anyways, it is unexpected to me, given that you have a queue that is
mapped to the cpu you are pinning on, I'd expect that all request
that are generated on this cpu would be submitted on that same
queue..

Anyone has any insights on this?