Re: The PQ=1 saga

Brian Bunker <brian@xxxxxxxxxxxxxxx> · Wed, 25 Jan 2023 11:56:55 -0800

> On Jan 24, 2023, at 8:04 PM, Martin K. Petersen <martin.petersen@xxxxxxxxxx> wrote:
> 
> 
> Brian,
> 
>> For a completely separate reason I would like to see PQ=1 expose the
>> sd device.
> 
> The host RAID controller case we could probably cover without relying on
> PQ=1 at all (we kind-of already do). But there are also storage arrays
> out there that rely on PQ=1 to inhibit devices being claimed.
> Historically they did this because some other operating systems couldn't
> handle a processor device type. So I suspect that keying off of TPGS
> alone is probably not sufficient to determine whether PQ=1 should cause
> us to attach a ULD or not in your scenario.
I had the idea to change the check in scsi_sysfs.c from:
-       return (sdp->inq_periph_qual == SCSI_INQ_PQ_CON)? 1: 0;
+       return (sdp->inq_periph_qual != SCSI_INQ_PQ_NOT_CAP)? 1: 0;

This would allow PQ=1 but not PQ=3 which I think is the right thing to do.
> 
>> ALUA state transitions from unavailable back to another state does not
>> work depending on what state devices are in when they are initially
>> discovered.  In the ALUA unavailable state the peripheral qualifier of
>> the device should also be set to 001b.
> 
> Yep, an unfortunate wrinkle in the spec (although it makes sense).
> 
>> This hole makes the unavailable ALUA state unattractive. Allowing the
>> peripheral qualifier set to 001b to still create an sd device on
>> discovery corrects this hole.
> 
> Does your implementation actually support READ CAPACITY etc. in
> unavailable state? Otherwise we'd end up with zero-length, read-only
> block devices with no logical block size. And we've been down that path
> before and that is no fun.
Yes we can support read capacity when in the unavailable state. For
us the unavailable state means that one controller or array can not
reach the other controller or array on the backend but the front end
ports are still connected. They are up from an initiator transport
perspective.
> 
> I suspect it would be better to trigger a re-probe of the device when
> transitioning out of unavailable state. Most of the logic is already in
> place and we reread VPD pages, etc. I believe there are only a few
> pieces missing from being able to do a full in-place update.
Unfortunately this doesn’t work. This does work in other OS’s where
I can logout the connection, and when it comes back it will discover
that the LUN no longer has the PQ set and will come online fine. But
in Linux this results in (after the PLOGI and PRLI):

Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: INQUIRY pass 1 length 36
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: INQUIRY successful with code 0x0
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: INQUIRY pass 2 length 96
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: INQUIRY successful with code 0x0
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: peripheral device type of 31, no device added
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: Sending REPORT LUNS to (try 0)
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: REPORT LUNS successful (try 0) result 0x0
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:0: scsi scan: REPORT LUN scan
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:1: scsi scan: device exists on 7:0:0:1
Jan 25 11:42:28 init72-5 kernel: scsi 7:0:0:2: scsi scan: device exists on 7:0:0:2

So unless those devices are removed before the rescan, which I
cannot control from the target, an sd device will not be created on 
the rescanning after the logout.

/dev/sg5  7 0 0 1  0  PURE      FlashArray        8888
/dev/sg6  7 0 0 2  0  PURE      FlashArray        8888

Thanks,
Brian

> 
> -- 
> Martin K. Petersen Oracle Linux Engineering