Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery

Sagi Grimberg <sagi@xxxxxxxxxxx> · Mon, 10 Jul 2023 12:27:31 +0300

I still want your patches for tcp/rdma that move the freeze.
If you are not planning to send them, I swear I will :)

Ming, can you please send the tcp/rdma patches that move the
freeze? As I said before, it addresses an existing issue with
requests unnecessarily blocked on a frozen queue instead of
failing over.

Any chance to fix the current issue in one easy(backportable) way[1] first?

There is, you suggested one. And I'm requesting you to send a patch for
it.

The patch is the one pointed by link [1], and it still can be applied on current
linus tree.

https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@xxxxxxxxxx/

This is separate from what I am talking about.

All previous discussions on delay freeze[2] are generic, which apply on all
nvme drivers, not mention this error handling difference causes extra maintain
burden. I still suggest to convert all drivers in same way, and will work
along the approach[1] aiming for v6.6.

But we obviously hit a difference in expectations from different
drivers. In tcp/rdma there is currently an _existing_ bug, where
we freeze the queue on error recovery, and unfreeze only after we
reconnect. In the meantime, requests can be blocked on the frozen
request queue and not failover like they should.

In fabrics the delta between error recovery and reconnect can (and
often will be) minutes or more. Hence I request that we solve _this_
issue which is addressed by moving the freeze to the reconnect path.

I personally think that pci should do this as well, and at least
dual-ported multipath pci devices would prefer instant failover
than after a full reset cycle. But Keith disagrees and I am not going to
push for it.

Regardless of anything we do in pci, the tcp/rdma transport
freeze-blocking-failover _must_ be addressed.

It is one generic issue, freeze/unfreeze has to be paired strictly
for every driver.

For any nvme driver, the inbalance can happen when error handling
is involved, that is why I suggest to fix the issue in one generic
way.

Ming, you are ignoring what I'm saying. I don't care if the
freeze/unfreeze is 100% balanced or not (for the sake of this
discussion).

I'm talking about a _separate_ issue where a queue
is frozen for potentially many minutes blocking requests that
could otherwise failover.

So can you please submit a patch for each? Please phrase it as what
it is, a bug fix, so stable kernels can pick it up. And try to keep
it isolated to _only_ the freeze change so that it is easily
backportable.

The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.

Ming, this is completely separate from what I'm talking about. This one
is addressing when the controller is removed, while I'm talking about
the error-recovery and failover, which is ages before the controller is
removed.

But as we discussed, we still want to call freeze/unfreeze in pair, and
I also suggest the following approach[2], which isn't good to backport:

	1) moving freeze into reset
	
	2) during resetting
	
	- freeze NS queues
	- unquiesce NS queues
	- nvme_wait_freeze()
	- update_nr_hw_queues
	- unfreeze NS queues
	
	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
	
	- if the request is FS IO with data, re-submit all bios of this request, and free the request
	
	- otherwise, fail the request


[2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@xxxxxxxxxxx/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc

Ming, please read again what my concern is. I'm talking about error 
recovery freezing a queue, and unfreezing only after we reconnect,
blocking requests that should failover.

The controller removal when the request queue is frozen is a separate
issue, which we should address, but separately.

What I am requesting is that you send two patches, one for tcp and
one for rdma that are identical to what you did in:
[PATCH 2/2] nvme: don't freeze/unfreeze queues from different contexts

rdma.c patch:
--

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0eb79696fb73..354cce8853c1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -918,6 +918,7 @@ static int nvme_rdma_configure_io_queues(struct 
nvme_rdma_ctrl *ctrl, bool new)
 		goto out_cleanup_tagset;

 	if (!new) {
+		nvme_start_freeze(&ctrl->ctrl);
 		nvme_unquiesce_io_queues(&ctrl->ctrl);
 		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
 			/*
@@ -926,6 +927,7 @@ static int nvme_rdma_configure_io_queues(struct 
nvme_rdma_ctrl *ctrl, bool new)
 			 * to be safe.
 			 */
 			ret = -ENODEV;
+			nvme_unfreeze(&ctrl->ctrl);
 			goto out_wait_freeze_timed_out;
 		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
@@ -975,7 +977,6 @@ static void nvme_rdma_teardown_io_queues(struct 
nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
 	if (ctrl->ctrl.queue_count > 1) {
-		nvme_start_freeze(&ctrl->ctrl);
 		nvme_quiesce_io_queues(&ctrl->ctrl);
 		nvme_sync_io_queues(&ctrl->ctrl);
 		nvme_rdma_stop_io_queues(ctrl);
--

tcp.c patch:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..5ae08e9cb16d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1909,6 +1909,7 @@ static int nvme_tcp_configure_io_queues(struct 
nvme_ctrl *ctrl, bool new)
 		goto out_cleanup_connect_q;

 	if (!new) {
+		nvme_start_freeze(ctrl);
 		nvme_unquiesce_io_queues(ctrl);
 		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
 			/*
@@ -1917,6 +1918,7 @@ static int nvme_tcp_configure_io_queues(struct 
nvme_ctrl *ctrl, bool new)
 			 * to be safe.
 			 */
 			ret = -ENODEV;
+			nvme_unfreeze(ctrl);
 			goto out_wait_freeze_timed_out;
 		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
@@ -2021,7 +2023,6 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
 	if (ctrl->queue_count <= 1)
 		return;
 	nvme_quiesce_admin_queue(ctrl);
-	nvme_start_freeze(ctrl);
 	nvme_quiesce_io_queues(ctrl);
 	nvme_sync_io_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
--

Nothing more.