Re: [PATCH 1/3] nvme-core: improve avoiding false remove namespace

Sagi Grimberg <sagi@xxxxxxxxxxx> · Wed, 19 Aug 2020 21:33:22 -0700

nvme_revalidate_disk translate return error to 0 if it is not a fatal
error, thus avoid false remove namespace. If return error less than 0,
now only ENOMEM be translated to 0, but other error except ENODEV,
such as EAGAIN or EBUSY etc, also need translate to 0.
Another reason for improving the error translation: If request timeout
when connect, __nvme_submit_sync_cmd will return
NVME_SC_HOST_ABORTED_CMD(>0). At this time, should terminate the
connect process, but falsely continue the connect process,
this may cause deadlock. Many functions which call
__nvme_submit_sync_cmd treat error code(> 0) as target not support and
continue, but NVME_SC_HOST_ABORTED_CMD and NVME_SC_HOST_PATH_ERROR both
are cancled io by host, to fix this bug, we need set the flag:
NVME_REQ_CANCELLED, thus __nvme_submit_sync_cmd will translate return
error to INTR. This is conflict with error translation of
nvme_revalidate_disk, may cause false remove namespace.

Signed-off-by: Chao Leng <lengchao@xxxxxxxxxx>
---
  drivers/nvme/host/core.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4..43ac8a1ad65d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2130,10 +2130,10 @@ static int _nvme_revalidate_disk(struct gendisk *disk)
  	 * Only fail the function if we got a fatal error back from the
  	 * device, otherwise ignore the error and just move on.
  	 */
-	if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR)))
-		ret = 0;
-	else if (ret > 0)
+	if (ret > 0 && (ret & NVME_SC_DNR))
  		ret = blk_status_to_errno(nvme_error_status(ret));
+	else if (ret != -ENODEV)
+		ret = 0;
  	return ret;

We really need to take a step back here, I really don't like how
we are growing implicit assumptions on how statuses are interpreted.

Why don't we remove the -ENODEV error propagation back and instead
take care of it in the specific call-sites where we want to ignore
errors with proper quirks?