On 3/26/19 6:49 PM, James Smart wrote:
Cancelling/flushing scan_work before starting reset won't work. Hannes
is, correct, reset must take precedent.
The scenario is a controller that was recently connected, thus scan work
started, and while the scan thread is running, there's an error or
connectivity is lost. The errors force the reset - which must run to
terminate any outstanding requests, including those issued by the scan
thread. This has to happen or we're going to hang the scan thread as
it's synchronous io. The
nvme_fc_delete_association->nvme_start_queues() sequence says the reset
path froze the queues, cancelled all outstanding io and is unfreezing
the queues to allow io to pass again, so that any new io on the request
queue can be rejected while the controller is not connected (ioctls or
subsequent io from the scan thread while the controller is unconnected
as well as anything that was pending on the io queues from the fs that
should be bounced back for multipath). Meanwhile, the scan thread got
the error on the io request, and since it's the identify command that
failed, it's deleting the namespace at the same time the transport is
unfreezing the queues.
Given the transport nvme_start_queues() routine has no idea what
ns-queues exist, or what state the ns queues are in, it seems like a
synchronization issue on the ns queue itself, that can't be solved by
scheduling transport work elements.
I wonder if it should be something like this:
--- a 2019-03-26 10:46:08.576056741 -0700
+++ b 2019-03-26 10:47:55.081252967 -0700
@@ -3870,8 +3870,10 @@ void nvme_start_queues(struct nvme_ctrl
struct nvme_ns *ns;
down_read(&ctrl->namespaces_rwsem);
- list_for_each_entry(ns, &ctrl->namespaces, list)
- blk_mq_unquiesce_queue(ns->queue);
+ list_for_each_entry(ns, &ctrl->namespaces, list) {
+ if (!test_bit(NVME_NS_REMOVING, &ns->flags))
+ blk_mq_unquiesce_queue(ns->queue);
+ }
up_read(&ctrl->namespaces_rwsem);
}
EXPORT_SYMBOL_GPL(nvme_start_queues);
which if valid, would also mean the same check should be in
nvme_stop_queues() and perhaps elsewhere.
Well, if the namespace wouldn't be part of the list in the first place
once it's been set to 'REMOVING'... lemme check.
Cheers,
Hannes