Re: The PQ=1 saga

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Jan 30, 2023, at 5:35 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
> 
> On 1/27/23 20:57, Brian Bunker wrote:
>> I was doing some more testing of this since it has been a while since I
>> ran these tests. It looks like reverting this will make the particular situation
>> that I am worried about even worse. I will put the detail in.
>> With this in place (before you revert it). When SCSI devices are discovered
>> and some have a PQ=1 because they are in an unavailable ALUA state:
>> Jan 27 12:05:29 localhost kernel: scsi 7:0:0:1: scsi scan: peripheral device type of 31, no device added
>> I don’t know if this intentional with the patch or not but any devices with PQ=1
>> will not create SCSI devices. The logging is deceptive too since the device type
>> Is 0 and not 31. In my case I have two paths to LUN 1. One is ALUA AO and the
>> other in ALUA unavailable.
>> With this patch in I only get an sd device and an sg device for the AO path.
>> The other path to LUN 1 gets no devices created because it is caught in the
>> If condition logged above.
>> Because there are no SCSI devices created, when the ALUA state returns
>> to an active state, a SCSI rescan, which I can trigger from the target will result
>> in the devices getting created since the initial scan never created devices.
>> Jan 27 12:26:04 localhost kernel: scsi 7:0:0:1: scsi scan: INQUIRY pass 1 length 36
>> Jan 27 12:26:04 localhost kernel: scsi 7:0:0:1: scsi scan: INQUIRY successful with code 0x0
>> Jan 27 12:26:04 localhost kernel: scsi 7:0:0:1: scsi scan: INQUIRY pass 2 length 96
>> Jan 27 12:26:04 localhost kernel: scsi 7:0:0:1: scsi scan: INQUIRY successful with code 0x0
>> Jan 27 12:26:04 localhost kernel: scsi 7:0:0:1: Direct-Access     PURE     FlashArray       8888 PQ: 0 ANSI: 6
>> Things are good with both paths to LUN 1 showing up. It is not optimal since the
>> target has to trigger a LUN scan on the initiator affecting all paths to those target
>> ports.
>> With the revert of this, things are a little different, but the way they had been in
>> the past.
>> Jan 27 13:41:19 localhost kernel: sd 7:0:1:1: Asymmetric access state changed
>> Jan 27 13:41:56 localhost kernel: scsi 7:0:1:1: alua: Detached
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: scsi scan: INQUIRY pass 1 length 36
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: scsi scan: INQUIRY successful with code 0x0
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: scsi scan: INQUIRY pass 2 length 96
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: scsi scan: INQUIRY successful with code 0x0
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: Direct-Access     PURE     FlashArray       8888 PQ: 1 ANSI: 6
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: alua: supports implicit TPGS
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: alua: device naa.624a9370acc31b042de141460001141c port group 0 rel port a
>> Jan 27 13:42:22 localhost kernel: scsi 7:0:1:1: Attached scsi generic sg7 type 0
>> Now an sg device is created but not an sd device. This means that there will be
>> no way for this device to get an sd device created once the ALUA state goes into
>> an active state.
>> The same thing done on the target that worked above no longer does:
>> Jan 27 13:47:48 localhost kernel: scsi 7:0:1:1: scsi scan: device exists on 7:0:1:1
>> To get around this, the existing disk must be deleted so it is not caught in the rescan
>> check. This cannot be controlled on the target, but it will require manual intervention
>> on the initiator.
>> So the question becomes how should initial scan work when a LUN has a PQ=1 set.
>> It is a valid, by spec with ALUA state unavailable but doesn’t seem to be
>> handled. Why allow an sg device but not an sd one on initial scan in this case? There
>> are probably many ways to fix this. I think the simplest is to allow sd device creation
>> on LUNs were PQ=1, and only restrict PQ=3. I am not sure the side effect of this on other
>> targets. The other approach which will no longer work after the revert is to trigger a
>> rescan from the target. This is sub-optimal since it is disruptive. Any approach involving
>> the ALUA device handler will not help since there is no device to transition if it is
>> discovered with PQ=1.
> Sheesh.
> 
> There _is_ an easy solution for this, and that is to not use PQ=1 in conjunction with ALUA unavailable :-)
> 
> Hiding PQ=1 devices did serve the purpose for linux as we still cannot to a 'real' rescan of a SCSI device; the 'vendor' and 'model' string is pretty much fixed for the lifetime of the device, alongside with the entire standard inquiry data. So if anything changes here we have to delete the device before we can properly read it.
> 
> (which also means that I'll have to retract my earlier comment about this being a good idea ...)
> 
> And in the absence of that hiding PQ=1 devices is the best we can do.
> The alternative would be to implement a 'real' device rescan; but that was too daunting a challenge to be undertaken until now.
> Things did change in the meantime, so maybe it's time to revisit that.
> 
> But really, we should ask vendors to _not_ use PQ=1 when using ALUA. I fail to see the benefit of this as both have roughly the same meaning; if you have ALUA unavailable you can't access the device, hence it's completely irrelevant what PQ says. And same for the other way round: if PQ=1 is set really the only ALUA state which makes sense is 'unavailable'.
> 
> Sadly it's not so easy to fix things up in the SCSI stack, as the PQ setting is evaluated during scanning, and the ALUA state way back later.
> 
> Cheers,
> 
> Hannes
What about something like this? This will remove the device if the PQ=1 and re-discover it. If the TPG
remains unavailable, it will just be created in the same way. If the TPG has moved to an active state
the newly created device will be an available sd device. This way at least target vendors can cause
the initiator to rescan and get the devices from unavailable to an active state without the manual
Intervention on each host having to remove the devices with the PQ=1 set and rescan manually.
If the manual removal of devices is required, it does make ALUA unavailable unrealistic.

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index f9b18fdc7b3c..9ff9ca1b963e 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1123,6 +1123,36 @@ static unsigned char *scsi_inq_str(unsigned char *buf, unsigned char *inq,
 }
 #endif
 
+/**
+ * scsi_remove_offline_device - remove the device if the criteria met
+ * * @sdev:    scsi_device to check
+ *
+ * Description:
+ * A SCSI device which is part of a TPG in the unavailable state will
+ * have the PQ=1. If the device is discovered this way, there is no
+ * way for it to transition to an active state. The device must be
+ * removed and rediscovered during rescan in the event that the TPG
+ * has transitioned to an active state.
+ *
+ * Return:
+ * true: the conditions are met for device removal
+ * false: the conditions are not met
+ **/
+static bool scsi_remove_offline_device(struct scsi_device *sdev)
+{
+       if (sdev == NULL || sdev->handler == NULL)
+               return false;
+
+       if (sdev->inq_periph_qual == SCSI_INQ_PQ_NOT_CON &&
+           (strncmp(sdev->handler->name, "alua", 4) == 0)) {
+               SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,
+                                 "scsi scan: discovered not accessible %s\n",
+                                 dev_name(&sdev->sdev_gendev)));
+               return true;
+       }
+       return false;
+}
+
 /**
  * scsi_probe_and_add_lun - probe a LUN, if a LUN is found add it
  * @starget:   pointer to target device structure
@@ -1161,6 +1191,10 @@ static int scsi_probe_and_add_lun(struct scsi_target *starget,
         * host adapter calls into here with rescan == 0.
         */
        sdev = scsi_device_lookup_by_target(starget, lun);
+       if (scsi_remove_offline_device(sdev)) {
+               __scsi_remove_device(sdev);
+               sdev = NULL;
+       }
        if (sdev) {
                if (rescan != SCSI_SCAN_INITIAL || !scsi_device_created(sdev)) {
                        SCSI_LOG_SCAN_BUS(3, sdev_printk(KERN_INFO, sdev,

Thanks,
Brian






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux