Re: [PATCH] block: don't call blk_mq_run_hw_queues() for dead or dying queues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/26/19 6:49 PM, James Smart wrote:

Cancelling/flushing scan_work before starting reset won't work. Hannes is, correct, reset must take precedent.

The scenario is a controller that was recently connected, thus scan work started, and while the scan thread is running, there's an error or connectivity is lost.  The errors force the reset - which must run to terminate any outstanding requests, including those issued by the scan thread.  This has to happen or we're going to hang the scan thread as it's synchronous io.  The nvme_fc_delete_association->nvme_start_queues() sequence says the reset path froze the queues, cancelled all outstanding io and is unfreezing the queues to allow io to pass again, so that any new io on the request queue can be rejected while the controller is not connected (ioctls or subsequent io from the scan thread while the controller is unconnected as well as anything that was pending on the io queues from the fs that should be bounced back for multipath).  Meanwhile, the scan thread got the error on the io request, and since it's the identify command that failed, it's deleting the namespace at the same time the transport is unfreezing the queues.

Given the transport nvme_start_queues() routine has no idea what ns-queues exist, or what state the ns queues are in, it seems like a synchronization issue on the ns queue itself, that can't be solved by scheduling transport work elements.

I wonder if it should be something like this:
--- a    2019-03-26 10:46:08.576056741 -0700
+++ b    2019-03-26 10:47:55.081252967 -0700
@@ -3870,8 +3870,10 @@ void nvme_start_queues(struct nvme_ctrl
      struct nvme_ns *ns;

      down_read(&ctrl->namespaces_rwsem);
-    list_for_each_entry(ns, &ctrl->namespaces, list)
-        blk_mq_unquiesce_queue(ns->queue);
+    list_for_each_entry(ns, &ctrl->namespaces, list) {
+        if (!test_bit(NVME_NS_REMOVING, &ns->flags))
+            blk_mq_unquiesce_queue(ns->queue);
+    }
      up_read(&ctrl->namespaces_rwsem);
  }
  EXPORT_SYMBOL_GPL(nvme_start_queues);

which if valid, would also mean the same check should be in nvme_stop_queues() and perhaps elsewhere.

Well, if the namespace wouldn't be part of the list in the first place once it's been set to 'REMOVING'... lemme check.

Cheers,

Hannes



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux