Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 10 Jul 2023 16:57:13 +0800

On Mon, Jul 10, 2023 at 11:23:44AM +0300, Sagi Grimberg wrote:
> 
> > > > > namespace's request queue is frozen and quiesced during error recovering,
> > > > > writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <-
> > > > > del_gendisk()
> > > > > can't move on, and causes IO hang. Removal could be from sysfs, hard
> > > > > unplug or error handling.
> > > > > 
> > > > > Fix this kind of issue by marking controller as DEAD if removal breaks
> > > > > error recovery.
> > > > > 
> > > > > This ways is reasonable too, because controller can't be recovered any
> > > > > more after being removed.
> > > > 
> > > > This looks fine to me Ming,
> > > > Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
> > > > 
> > > > 
> > > > I still want your patches for tcp/rdma that move the freeze.
> > > > If you are not planning to send them, I swear I will :)
> > > 
> > > Ming, can you please send the tcp/rdma patches that move the
> > > freeze? As I said before, it addresses an existing issue with
> > > requests unnecessarily blocked on a frozen queue instead of
> > > failing over.
> > 
> > Any chance to fix the current issue in one easy(backportable) way[1] first?
> 
> There is, you suggested one. And I'm requesting you to send a patch for
> it.

The patch is the one pointed by link [1], and it still can be applied on current
linus tree.

https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@xxxxxxxxxx/

> 
> > 
> > All previous discussions on delay freeze[2] are generic, which apply on all
> > nvme drivers, not mention this error handling difference causes extra maintain
> > burden. I still suggest to convert all drivers in same way, and will work
> > along the approach[1] aiming for v6.6.
> 
> But we obviously hit a difference in expectations from different
> drivers. In tcp/rdma there is currently an _existing_ bug, where
> we freeze the queue on error recovery, and unfreeze only after we
> reconnect. In the meantime, requests can be blocked on the frozen
> request queue and not failover like they should.
> 
> In fabrics the delta between error recovery and reconnect can (and
> often will be) minutes or more. Hence I request that we solve _this_
> issue which is addressed by moving the freeze to the reconnect path.
> 
> I personally think that pci should do this as well, and at least
> dual-ported multipath pci devices would prefer instant failover
> than after a full reset cycle. But Keith disagrees and I am not going to
> push for it.
> 
> Regardless of anything we do in pci, the tcp/rdma transport
> freeze-blocking-failover _must_ be addressed.

It is one generic issue, freeze/unfreeze has to be paired strictly
for every driver.

For any nvme driver, the inbalance can happen when error handling
is involved, that is why I suggest to fix the issue in one generic
way.

> 
> So can you please submit a patch for each? Please phrase it as what
> it is, a bug fix, so stable kernels can pick it up. And try to keep
> it isolated to _only_ the freeze change so that it is easily
> backportable.

The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.

But as we discussed, we still want to call freeze/unfreeze in pair, and
I also suggest the following approach[2], which isn't good to backport:

	1) moving freeze into reset

	2) during resetting

	- freeze NS queues
	- unquiesce NS queues
	- nvme_wait_freeze()
	- update_nr_hw_queues
	- unfreeze NS queues

	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,

	- if the request is FS IO with data, re-submit all bios of this request, and free the request

	- otherwise, fail the request

[2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@xxxxxxxxxxx/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc

thanks,
Ming