Re: [PATCH 03/24] zfcp: open-code fc_block_scsi_eh() for host reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hannes, I'm working on a solution, just need more time.

On 5/4/22 13:40, Steffen Maier wrote:
On 5/4/22 12:46, Steffen Maier wrote:
On 5/3/22 22:06, Hannes Reinecke wrote:
When issuing a host reset we should be waiting for all
ports to become unblocked; just waiting for one might
be resulting in host reset to return too early.

Signed-off-by: Hannes Reinecke <hare@xxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Cc: Steffen Maier <maier@xxxxxxxxxxxxx>
Cc: Benjamin Block <bblock@xxxxxxxxxxxxx>
---
  drivers/s390/scsi/zfcp_scsi.c | 21 +++++++++++++++------
  1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/s390/scsi/zfcp_scsi.c b/drivers/s390/scsi/zfcp_scsi.c
index 526ac240d9fe..2e615e1dcde6 100644
--- a/drivers/s390/scsi/zfcp_scsi.c
+++ b/drivers/s390/scsi/zfcp_scsi.c
@@ -373,9 +373,11 @@ static int zfcp_scsi_eh_target_reset_handler(struct scsi_cmnd *scpnt)
  static int zfcp_scsi_eh_host_reset_handler(struct scsi_cmnd *scpnt)
  {
-    struct zfcp_scsi_dev *zfcp_sdev = sdev_to_zfcp(scpnt->device);
-    struct zfcp_adapter *adapter = zfcp_sdev->port->adapter;
-    int ret = SUCCESS, fc_ret;
+    struct Scsi_Host *host = scpnt->device->host;
+    struct zfcp_adapter *adapter = (struct zfcp_adapter *)host->hostdata[0];
+    int ret = SUCCESS;
+    unsigned long flags;
+    struct zfcp_port *port;
      if (!(adapter->connection_features & FSF_FEATURE_NPIV_MODE)) {
          zfcp_erp_port_forced_reopen_all(adapter, 0, "schrh_p");
@@ -383,9 +385,16 @@ static int zfcp_scsi_eh_host_reset_handler(struct scsi_cmnd *scpnt)
      }
      zfcp_erp_adapter_reopen(adapter, 0, "schrh_1");
      zfcp_erp_wait(adapter);

Now internal zfcp recovery for the adapter completed, but there are async pending follow-up things, such as rport(s) unblock, see below.

-    fc_ret = fc_block_scsi_eh(scpnt);
-    if (fc_ret)
-        ret = fc_ret;
+
+    spin_lock_irqsave(&adapter->port_list_lock, flags);

zfcp_adapter.port_list_lock is of type rwlock_t:

+    read_lock_irqsave(&adapter->port_list_lock, flags);

+    list_for_each_entry(port, &adapter->port_list, list) {
+        struct fc_rport *rport = port->rport;

While an rport is blocked, port->rport == NULL [zfcp_scsi_rport_block() and
zfcp_scsi_rport_register()], so this could be a NULL pointer deref.

Typically all rports would still be blocked after above adapter recovery, until they become unblocked (via zfcp's async rport_work) [zfcp_erp_try_rport_unblock() near the end of lun and port recovery that happen as follow-up parts of adapter recovery and before zfcp_erp_wait() returns; it eventually calls zfcp_scsi_schedule_rport_register() queueing work item port->rport_work on which we could do flush_work() [one prior art in zfcp_init_device_configure()]].

Am starting to wonder if we would really need to sync with the async rports unblocks (for this adapter) after zfcp_erp_wait() above and before any deref port->rport. Either within this loop or with a separate loop before this one. ( Another option would be to somehow iterate rports differently via common code lists so we directly get rport references. )

+
+        ret = fc_block_rport(rport);

I think I just noticed that that callee includes an msleep but we hold a read lock on port_list_lock. That does not seem right [scheduling while atomic?!]. Maybe we would need an open coded version after all to be able to drop both locks in correct order (and re-acquire) in the open-coded copy of fc_block_rport()?


Lock order looks good, we hold port_list_lock here and fc_block_rport() takes Scsi_Host host_lock, so it matches prior art:
static void zfcp_erp_action_dismiss_adapter(struct zfcp_adapter *adapter)
         read_lock(&adapter->port_list_lock);
             zfcp_erp_action_dismiss_port(port);
             spin_lock(port->adapter->scsi_host->host_lock);

+        if (ret)
+            break;

So once we find the first rport with
(rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT)
we would break the loop and return from zfcp_scsi_eh_host_reset_handler() with FAST_IO_FAIL. Is that correct? What if there are still other blocked rports if we were to continue with the loop? Or am I missing something regarding the goal of "waiting for all rports to become unblocked"?

Once we completed the loop, the question arises which consolidated return code would be the correct one, if we got different ret for different rports in the loop. Does my previous brain dump make sense?:
I was pondering in my own patch version what to return of just a subset
of ports ran into fast_io_fail. And for now I thought just fast_io_fail
would be good to let I/O bubble up for path failover, even if this would
include of rport which meanwhile unblocked properly and would not need
bubbling up pending requests because they could service them with a
simple retry.

+    }
+    spin_unlock_irqrestore(&adapter->port_list_lock, flags);

+    read_unlock_irqrestore(&adapter->port_list_lock, flags);

      zfcp_dbf_scsi_eh("schrh_r", adapter, ~0, ret);
      return ret;






--
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: David Faller
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux