On Tue, Jan 04, 2022 at 02:15:58PM +0200, Max Gurtovoy wrote: > > This patch worked for me with 2 namespaces for NVMe PCI. > > I'll check it later on with my RDMA queue_rqs patches as well. There we have > also a tagset sharing with the connect_q (and not only with multiple > namespaces). > > But the connect_q is using a reserved tags only (for the connect commands). > > I saw some strange things that I couldn't understand: > > 1. running randread fio with libaio ioengine didn't call nvme_queue_rqs - > expected > > *2. running randwrite fio with libaio ioengine did call nvme_queue_rqs - Not > expected !!* > > *3. running randread fio with io_uring ioengine (and --iodepth_batch=32) > didn't call nvme_queue_rqs - Not expected !!* > > 4. running randwrite fio with io_uring ioengine (and --iodepth_batch=32) did > call nvme_queue_rqs - expected > > 5. *running randread fio with io_uring ioengine (and --iodepth_batch=32 > --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio > jobs required "kill -9 fio" to remove refcounts from nvme_core) - Not > expected !!* > > *debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 > seconds, it appears to be stuck. Doing forceful exit of this job. > * > > *6. ***running randwrite fio with io_uring ioengine (and --iodepth_batch=32 > --runtime=30) didn't finish after 30 seconds and stuck for 300 seconds (fio > jobs required "kill -9 fio" to remove refcounts from nvme_core) - Not > expected !!** > > ***debug pring: fio: job 'task_nvme0n1' (state=5) hasn't exited in 300 > seconds, it appears to be stuck. Doing forceful exit of this job.*** > > > any idea what could cause these unexpected scenarios ? at least unexpected > for me :) Not sure about all the scenarios. I believe it should call queue_rqs anytime we finish a plugged list of requests as long as the requests come from the same request_queue, and it's not being flushed from io_schedule(). The stuck fio job might be a lost request, which is what this series should address. It would be unusual to see such an error happen in normal operation, though. I had to synthesize errors to verify the bug and fix. In any case, I'll run more multi-namespace tests to see if I can find any other issues with shared tags.