On 9/15/22 14:50, Daniel Wagner wrote:
So do we agree the fc host should not stop reconnecting? James?
With the change below the test succeeds once:
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 42767fb75455..b167cf9fee6d 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3325,8 +3325,6 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: reset: Reconnect attempt failed (%d)\n",
ctrl->cnum, status);
- if (status > 0 && (status & NVME_SC_DNR))
- recon = false;
} else if (time_after_eq(jiffies, rport->dev_loss_end))
recon = false;
This part was introduced with f25f8ef70ce2 ("nvme-fc: short-circuit
reconnect retries"):
nvme-fc: short-circuit reconnect retries
Returning an nvme status from nvme_fc_create_association() indicates
that the association is established, and we should honour the DNR bit.
If it's set a reconnect attempt will just return the same error, so
we can short-circuit the reconnect attempts and fail the connection
directly.
It looks like the reasoning here didn't take into consideration the
scenario we have here. I'd say we should not do it and handle it similar
as with have with tcp/rdma.
But that is wrong.
When we are evaluating the NVMe status we _need_ to check the DNR bit;
that's precisely what it's for.
The real reason here is a queue inversion with fcloop; we've had them
for ages, and I'm not surprised that it pops off now.
In fact, I was quite surprised that I didn't hit it when updating
blktests; in previous versions I had to insert an explicit 'sleep 2'
before disconnect to make it work.
I'd rather fix that than reverting FC to the (wrong) behaviour.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman