Re: [PATCH 13/16] fnic: allocate device reset command on the fly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/18/23 00:31, Karan Tilak Kumar (kartilak) wrote:
On Wednesday, November 15, 2023 4:22 PM, Karan Tilak Kumar (kartilak) wrote:

On Tuesday, November 14, 2023 12:02 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
I integrated your patch set using "b4 shazam" and built all the changes to do some dev testing.
I instrumented the code to do the following:

- After one million IOs, drop one IO response.
- This will trigger an abort. Drop that abort response.
- This will trigger a device reset. I'm seeing that the tag here is 0xFFFFFFFF (SCSI_NO_TAG).

This tag fails the out-of-range tag check and escalates to host reset.
I've been able to repro this three times with the same result.

The expectation with this test case is that the device reset should go through successfully without escalating to host reset.

Thanks for the report.
Certainly not what was intended with the patch. Can you try with this patch on top:

I applied this patch on top of the previous one and performed the unit tests that I had mentioned before.


(The first hunk is actually a bugfix, and will be included with the next submission anyway.) Please enable fnic debug during testing and send me the logs.
It _might_ be that this is actually by design, as we will be failing if all tags are busy (ie if the error is EWOULDBLOCK).
But if we get EAGAIN it means that the queue is blocked and we'll need to investigate.

There were two attempts at reproduction. One of them resulted in a crash. I have enabled fnic debug and captured the log.
I've pasted the excerpt here for reference.

Repro 1:

[ .. ]

=============
Repro 2:

[ .. ]


Regards,
Karan


Repro 3:

Nov 17 15:27:00 rhel-c4s5 kernel: fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0x52 sc: 00000000fab117f4 CDB Opcode: 0x28 Dropping icmnd completion
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: Abort Cmd called FCID 0x5205f2, LUN 0x2 TAG 52 flags 3
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: CBD Opcode: 28 Abort issued time: 30067 msec
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0x52 sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from abort cmd type 2 FAILED
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset called FCID 0x5205f2, LUN 0x2 sc: 00000000fab117f4
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset allocation failed with -11
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_reset called
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: update_mac 00:25:b5:cc:aa:00
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Issued fw reset
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: set port_id 0 fp 0000000000000000
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from fnic reset SUCCESS
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_cleanup_io: tag:0x52 : sc:0x00000000fab117f4 duration = 52450 DID_TRANSPORT_DISRUPTED
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Calling done for IO not issued to fw: tag:0x52 sc:0x00000000fab117f4
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: reset cmpl success
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca
Nov 17 15:27:54 rhel-c4s5 kernel: host7: Assigned Port ID 0d0880
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: set port_id d0880 fp 0000000067a66565
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: update_mac 0e:fc:00:0d:08:80
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: FLOGI reg issued fcid d0880 map 0 dest 8c:60:4f:95:ea:a4
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: flog reg succeeded
Nov 17 15:28:06 rhel-c4s5 kernel: sd 7:0:3:2: Power-on or device reset occurred

Seems that 'blk_queue_enter()' called from scsi_alloc_request() is failing, presumably as the queue is frozen/quiesced.
Can you try with the attached patch instead of the previous debug patch?

On, and incidentally: there's an unlock missing:

diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
index 0278c4a207f3..47bcc6bd7376 100644
--- a/drivers/scsi/fnic/fnic_scsi.c
+++ b/drivers/scsi/fnic/fnic_scsi.c
@@ -2233,8 +2233,10 @@ int fnic_device_reset(struct scsi_device *sdev)
        io_lock = fnic_io_lock_hash(fnic, sc);
        spin_lock_irqsave(io_lock, flags);
        io_req = fnic_priv(sc)->io_req;
-       if (io_req)
+       if (io_req) {
+               spin_unlock_irqrestore(io_lock, flags);
                goto fnic_device_reset_end;
+       }

        io_req = mempool_alloc(fnic->io_req_pool, GFP_ATOMIC);
        if (!io_req) {

Maybe fold it in with your patchset (if it's not already merged).

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich
From f7eba73f0654ebd2060b92ab08c9458243a9c914 Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@xxxxxxx>
Date: Mon, 20 Nov 2023 08:41:16 +0100
Subject: [PATCH] fnic: debug device reset command allocation failures

Add code to debug why device reset command allocation fails.

Signed-off-by: Hannes Reinecke <hare@xxxxxxx>
---
 drivers/scsi/fnic/fnic_scsi.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
index 8909cf09cf9e..0278c4a207f3 100644
--- a/drivers/scsi/fnic/fnic_scsi.c
+++ b/drivers/scsi/fnic/fnic_scsi.c
@@ -2209,8 +2209,8 @@ int fnic_device_reset(struct scsi_device *sdev)
 	}
 
 	req = scsi_alloc_request(sdev->request_queue, REQ_OP_DRV_IN,
-				 BLK_MQ_REQ_NOWAIT);
-	if (!req) {
+				 BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PM);
+	if (IS_ERR(req)) {
 		/*
 		 * Request allocation might fail, indicating that
 		 * all tags are busy.
@@ -2221,7 +2221,10 @@ int fnic_device_reset(struct scsi_device *sdev)
 		 * to host reset anyway.
 		 */
 		FNIC_SCSI_DBG(KERN_ERR, fnic->lport->host,
-			      "Device reset allocation failed, all tags busy\n");
+			      "Device reset allocation failed (error %ld%s%s)\n",
+			      PTR_ERR(req),
+			      sdev->request_queue->mq_freeze_depth ? ",frozen" : "",
+			      sdev->quiesced_by ? ",quiesced" : "");
 		return ret;
 	}
 	sc = blk_mq_rq_to_pdu(req);
-- 
2.35.3


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux