On Sep 13, 2022 / 08:57, Daniel Wagner wrote: > The target is allowed to change the number of I/O queues. Test if the > host is able to reconnect in this scenario. > > Signed-off-by: Daniel Wagner <dwagner@xxxxxxx> > --- > I've send a fix for the bug reported in nvmet_subsys_attr_qid_max_show() Thanks! > > Though I have found another problem with this test: > > nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:. > nvme nvme0: creating 8 I/O queues. > nvme nvme0: mapped 8/0/0 default/read/poll queues. > nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 10.161.61.228:4420 > nvme nvme0: starting error recovery > > ====================================================== > WARNING: possible circular locking dependency detected > 6.0.0-rc2+ #26 Not tainted > ------------------------------------------------------ > kworker/6:6/17471 is trying to acquire lock: > ffff888103d76440 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm] > > but task is already holding lock: > ffffc90002497de0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x4ec/0xa90 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}: > process_one_work+0x565/0xa90 > worker_thread+0x2c0/0x710 > kthread+0x16c/0x1a0 > ret_from_fork+0x1f/0x30 > > -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}: > __flush_workqueue+0x105/0x850 > nvmet_rdma_cm_handler+0xf9f/0x174e [nvmet_rdma] > cma_cm_event_handler+0x77/0x2d0 [rdma_cm] > cma_ib_req_handler+0xbd4/0x23c0 [rdma_cm] > cm_process_work+0x2f/0x210 [ib_cm] > cm_req_handler+0xf73/0x2020 [ib_cm] > cm_work_handler+0x6d6/0x37c0 [ib_cm] > process_one_work+0x5b6/0xa90 > worker_thread+0x2c0/0x710 > kthread+0x16c/0x1a0 > ret_from_fork+0x1f/0x30 > > -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}: > __mutex_lock+0x12d/0xe40 > cma_ib_req_handler+0x956/0x23c0 [rdma_cm] > cm_process_work+0x2f/0x210 [ib_cm] > cm_req_handler+0xf73/0x2020 [ib_cm] > cm_work_handler+0x6d6/0x37c0 [ib_cm] > process_one_work+0x5b6/0xa90 > worker_thread+0x2c0/0x710 > kthread+0x16c/0x1a0 > ret_from_fork+0x1f/0x30 > > -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}: > __lock_acquire+0x1a9d/0x2d90 > lock_acquire+0x15d/0x3f0 > __mutex_lock+0x12d/0xe40 > rdma_destroy_id+0x17/0x20 [rdma_cm] > nvmet_rdma_free_queue+0x54/0x180 [nvmet_rdma] > nvmet_rdma_release_queue_work+0x2c/0x70 [nvmet_rdma] > process_one_work+0x5b6/0xa90 > worker_thread+0x2c0/0x710 > kthread+0x16c/0x1a0 > ret_from_fork+0x1f/0x30 > > other info that might help us debug this: > > Chain exists of: > &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work) > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock((work_completion)(&queue->release_work)); > lock((wq_completion)nvmet-wq); > lock((work_completion)(&queue->release_work)); > lock(&id_priv->handler_mutex); I also observe this WARNING on my test machine with nvme_trtype=rdma. It looks a hidden rdma issue for me... > > And there is the question how to handle the different kernel logs output: > > c740:~/blktests # nvme_trtype=tcp ./check nvme/046 > nvme/046 (Test queue count changes on reconnect) [passed] > runtime 32.750s ... 32.340s > c740:~/blktests # nvme_trtype=rdma ./check nvme/046 > nvme/046 (Test queue count changes on reconnect) [failed] > runtime 32.340s ... 24.716s > something found in dmesg: > [307171.053929] run blktests nvme/046 at 2022-09-13 08:32:07 > [307171.884448] rdma_rxe: loaded > [307171.965507] eth0 speed is unknown, defaulting to 1000 > [307171.967049] eth0 speed is unknown, defaulting to 1000 > [307171.968671] eth0 speed is unknown, defaulting to 1000 > [307172.057714] infiniband eth0_rxe: set active > [307172.058544] eth0 speed is unknown, defaulting to 1000 > [307172.058630] infiniband eth0_rxe: added eth0 > [307172.107627] eth0 speed is unknown, defaulting to 1000 > [307172.230771] nvmet: adding nsid 1 to subsystem blktests-feature-detect > ... > (See '/root/blktests/results/nodev/nvme/046.dmesg' for the entire message) > c740:~/blktests # nvme_trtype=fc ./check nvme/046 > nvme/046 (Test queue count changes on reconnect) [failed] > runtime 24.716s ... 87.454s > --- tests/nvme/046.out 2022-09-09 16:23:22.926123227 +0200 > +++ /root/blktests/results/nodev/nvme/046.out.bad 2022-09-13 08:35:03.716097654 +0200 > @@ -1,3 +1,89 @@ > Running nvme/046 > -NQN:blktests-subsystem-1 disconnected 1 controller(s) > +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory > +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory > +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory > +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory > +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory > ... > (Run 'diff -u tests/nvme/046.out /root/blktests/results/nodev/nvme/046.out.bad' to see the entire diff) > > The grep error is something I can fix in the test but I don't know how > to handle the 'eth0 speed is unknown' kernel message which will make > the test always fail. Is it possible to ignore them when parsing the > kernel log for errors? FYI, each blktests test case can define DMESG_FILTER not to fail with specific keywords in dmesg. Test cases meta/011 and block/028 are reference use cases. Having said that, I don't think 'eth0 speed is unknown' in dmesg makes the test case fail. See _check_dmesg() in "check" script, which lists keywords that blktests handles as errors. I guess the WARNING above is the failure cause, probably. -- Shin'ichiro Kawasaki