Re: crash on device removal

Sagi Grimberg <sagi@xxxxxxxxxxx> · Wed, 13 Jul 2016 13:06:01 +0300

We actually missed a kref_get in nvme_get_ns_from_disk().

This should fix it. Could you help to verify?

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4babdf0..b146f52 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -183,6 +183,8 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct
gendisk *disk)
  	}
  	spin_unlock(&dev_list_lock);

+	kref_get(&ns->ctrl->kref);
+
  	return ns;

  fail_put_ns:

Hey Ming.  This avoids the crash in nvme_rdma_free_qe(), but now I see another crash:

[  975.633436] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.1.14:4420
[  978.463636] nvme nvme0: creating 32 I/O queues.
[  979.187826] nvme nvme0: new ctrl: NQN "testnqn", addr 10.0.1.14:4420
[  987.778287] nvme nvme0: Got rdma device removal event, deleting ctrl
[  987.882202] BUG: unable to handle kernel paging request at ffff880e770e01f8
[  987.890024] IP: [<ffffffffa03a1a46>] __ib_process_cq+0x46/0xc0 [ib_core]

This looks like another problem with freeing the tag sets before stopping the QP.  I thought we fixed that once and for all, but perhaps there is some other path we missed. :(

The fix doesn't look right to me. But I wander how you got this crash
now? if at all, this would delay the controller removal...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html