Re: [PATCHv2 1/3] block: introduce rq_list_for_each_safe macro

Max Gurtovoy <mgurtovoy@xxxxxxxxxx> · Thu, 6 Jan 2022 13:54:28 +0200

On 1/5/2022 7:26 PM, Keith Busch wrote:
On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote:
This patch worked for me with 2 namespaces for NVMe PCI.

I'll check it later on with my RDMA queue_rqs patches as well. There we have
also a tagset sharing with the connect_q (and not only with multiple
namespaces).

But the connect_q is using a reserved tags only (for the connect commands).

I saw some strange things that I couldn't understand:

1. running randread fio with libaio ioengine didn't call nvme_queue_rqs -
expected

*2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not
expected !!*

*3. running randread fio with io_uring ioengine (and --iodepth_batch=32)
didn't call nvme_queue_rqs - Not expected !!*

4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did
call nvme_queue_rqs - expected

5. *running randread fio with io_uring ioengine (and --iodepth_batch=32
--runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
expected !!*

*debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
seconds, it appears to be stuck. Doing forceful exit of this job.
*

*6. ***running randwrite fio with io_uring ioengine (and  --iodepth_batch=32
--runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio
jobs required "kill -9 fio" to remove refcounts from nvme_core)   - Not
expected !!**

***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300
seconds, it appears to be stuck. Doing forceful exit of this job.***

any idea what could cause these unexpected scenarios ? at least unexpected
for me :)
Not sure about all the scenarios. I believe it should call queue_rqs
anytime we finish a plugged list of requests as long as the requests
come from the same request_queue, and it's not being flushed from
io_schedule().

I also see we have batch > 1 only in the start of the fio operation. 
After X IO operations batch size == 1 till the end of the fio.

The stuck fio job might be a lost request, which is what this series
should address. It would be unusual to see such an error happen in
normal operation, though. I had to synthesize errors to verify the bug
and fix.

But there is no timeout error and prints in the dmesg.

If there was a timeout prints I would expect the issue might be in the 
local NVMe device, but there isn't.

Also this phenomena doesn't happen with NVMf/RDMA code I developed locally.

In any case, I'll run more multi-namespace tests to see if I can find
any other issues with shared tags.

I believe that the above concerns are not related to the shared-tags but 
to the entire mechanism.