Re: [PATCH] scsi: Do not break scan luns loop if add single lun failed

Wenchao Hao <haowenchao@xxxxxxxxxx> · Tue, 4 Jan 2022 20:10:19 +0800

On 2021/12/31 1:55, Steffen Maier wrote:
On 12/26/21 00:29, Wenchao Hao wrote:
Failed to add a single lun does not mean all luns are unaccessible,
if we break the scan luns loop, the other luns reported by REPORT LUNS
command would not be probed any more.

In this case, we might loss some luns which are accessible.

Could you please add more details about the specific use case, where 
this actually was a problem, for my understanding?


When REPORT LUNS returns 4 luns which are lun0, lun1, lun2 and lun3.
If lun1 becomes inaccessible during the scan flow, 
scsi_probe_and_add_lun() for lun1 would failed, lun2 and lun3 are still 
accessible. scsi_report_lun_scan() would print error log and return 0, 
and scsi_sequential_lun_scan() would not be called.

In this scenario, lun2 and lun3 would not been probed and added any 
more, so we loss them.


Signed-off-by: Wenchao Hao <haowenchao@xxxxxxxxxx>
---
  drivers/scsi/scsi_scan.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 23e1c0acdeae..fee7ce082103 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1476,13 +1476,13 @@ static int scsi_report_lun_scan(struct 
scsi_target *starget, blist_flags_t bflag
                  lun, NULL, NULL, rescan, NULL);
              if (res == SCSI_SCAN_NO_RESPONSE) {
                  /*
-                 * Got some results, but now none, abort.
+                 * Got some results, but now none, abort this lun

abort => skip ?

Yes, "skip" looks better than "abort".


                   */
                  sdev_printk(KERN_ERR, sdev,
                      "Unexpected response"
                      " from lun %llu while scanning, scan"
                      " aborted\n", (unsigned long long)lun);

That message would no longer be correct with your change, as it would 
not abort the scan any more.

I would change "abort" to "skip" which makes it better.


-                break;
+                continue;
              }
          }
      }


Wouldn't this change existing semantics for LLDDs intentionally 
returning -ENXIO from their slave_alloc() callback in certain cases?:



Yes, it would print error message like "Unexpected response ..." for 
every failed lun. I think it's reasonable, so we can know every failed 
lun in one scan flow.

static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
...
    if (shost->hostt->slave_alloc) {
        ret = shost->hostt->slave_alloc(sdev);
        if (ret) {
            /*
             * if LLDD reports slave not present, don't clutter
             * console with alloc failure messages
             */
            if (ret == -ENXIO)
                display_failure_msg = 0;
            goto out_device_destroy;
...
out_device_destroy:
    __scsi_remove_device(sdev);
out:
    if (display_failure_msg)
        printk(ALLOC_FAILURE_MSG, __func__);
    return NULL;


scsi_probe_and_add_lun() [such as called by scsi_report_lun_scan() for 
the case at hand] converts this case into a SCSI_SCAN_NO_RESPONSE return 
value.

static int scsi_probe_and_add_lun(struct scsi_target *starget,
...
    int res = SCSI_SCAN_NO_RESPONSE, result_len = 256;
...
        sdev = scsi_alloc_sdev(starget, lun, hostdata);
    if (!sdev)
        goto out;
...
 out:
    return res;


Such as being used by zfcp:

static int zfcp_scsi_slave_alloc(struct scsi_device *sdev)
{
...
    unit = zfcp_unit_find(port, zfcp_scsi_dev_lun(sdev));
    if (unit)
        put_device(&unit->dev);

    if (!unit && !(allow_lun_scan && npiv)) {
        put_device(&port->dev);
        return -ENXIO;
                       ^^^^^^

which implements an initiator-based LUN masking that is necessary for 
shared HBAs virtualized without NPIV.
https://www.ibm.com/docs/en/linux-on-systems?topic=devices-manually-configured-fcp-luns 


While things might still work, as zfcp now "just" gets (much) more 
callbacks to slave_alloc() it has to end with -ENXIO, the user may get 
flooded with the error(!) sdev_printk on "Unexpected response from LUN 
..." in scsi_report_lun_scan().
In the worst case, we could get this message now 64k - 1 times in a zfcp 
scenario connected to IBM DS8000 storage being able to map (all) 64k 
volumes to a single initiator (HBA), where the user via zfcp sysfs 
decided to use only the first lun reported (for the vHBA).


64k - 1 times error log seems terrible. While I do not understand what 
"where the user via zfcp sysfs decided to use only the first lun 
reported (for the vHBA)" means.

Why would all luns slave_alloc() failed? This don't seem like a normal 
scenario.

Other LLLDs also seem to intentionally return -ENXIO from slave_alloc() 
callbacks, such as but not limited to lpfc or qla2xxx:

int fc_slave_alloc(struct scsi_device *sdev)
{
    struct fc_rport *rport = starget_to_rport(scsi_target(sdev));

    if (!rport || fc_remote_port_chkready(rport))
        return -ENXIO;

static int
qla2xxx_slave_alloc(struct scsi_device *sdev)
{
    struct fc_rport *rport = starget_to_rport(scsi_target(sdev));

    if (!rport || fc_remote_port_chkready(rport))
        return -ENXIO;