Re: [PATCH V2] nvme: mark ctrl as DEAD if removing from error recovery

Christoph Hellwig <hch@xxxxxx> · Thu, 29 Jun 2023 09:33:05 +0200

On Thu, Jun 29, 2023 at 02:48:18PM +0800, Ming Lei wrote:
> @@ -4054,8 +4055,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>  	 * disconnected. In that case, we won't be able to flush any data while
>  	 * removing the namespaces' disks; fail all the queues now to avoid
>  	 * potentially having to clean up the failed sync later.
> +	 *
> +	 * If this removal happens during error recovering, resetting part
> +	 * may not be started, or controller isn't be recovered completely,
> +	 * so we have to treat controller as DEAD for avoiding IO hang since
> +	 * queues can be left as frozen and quiesced.
>  	 */
> -	if (ctrl->state == NVME_CTRL_DEAD) {
> +	if (ctrl->state == NVME_CTRL_DEAD ||
> +	    ctrl->old_state != NVME_CTRL_LIVE) {
>  		nvme_mark_namespaces_dead(ctrl);
>  		nvme_unquiesce_io_queues(ctrl);

Thanks for the comment and style, but I really still think doing
the state check was wrong to start with, and adding a check on the
old state makes things significantly worse.  Can we try to brainstorm
on how do this properly?

I think we need to first figure out how to balance the quiesce/unquiesce
calls, the placement of the nvme_mark_namespaces_dead call should
be the simple part.