Antw: [EXT] Re: [dm-devel] RFC: one more time: SCSI device identification

"Ulrich Windl" <Ulrich.Windl@xxxxxxxxxxxxxxxxxxxx> · Tue, 27 Apr 2021 09:02:10 +0200

>>> Erwin van Londen <erwin@xxxxxxxxxxxxxxxxxx> schrieb am 27.04.2021 um 05:48 in
Nachricht
<b5f288fb43bc79e0206794a901aef5b1761813de.camel@xxxxxxxxxxxxxxxxxx>:

> 
> On Mon, 2021-04-26 at 13:16 +0000, Martin Wilck wrote:
>> On Mon, 2021-04-26 at 13:14 +0200, Ulrich Windl wrote:
>> > > > 
>> > > 
>> > > While we're at it, I'd like to mention another issue: WWID
>> > > changes.
>> > > 
>> > > This is a big problem for multipathd. The gist is that the device
>> > > identification attributes in sysfs only change after rescanning
>> > > the
>> > > device. Thus if a user changes LUN assignments on a storage
>> > > system,
>> > > it can happen that a direct INQUIRY returns a different WWID as
>> > > in
>> > > sysfs, which is fatal. If we plan to rely more on sysfs for
>> > > device
>> > > identification in the future, the problem gets worse. 
>> > 
>> > I think many devices rely on the fact that they are identified by
>> > Vendor/model/serial_nr, because in most professional SAN storage
>> > systems you
>> > can pre-set the serial number to a custom value; so if you want a
>> > new
>> > disk
>> > (maybe a snapshot) to be compatible with the old one, just assign
>> > the
>> > same
>> > serial number. I guess that's the idea behind.
>> 
>> What you are saying sounds dangerous to me. If a snapshot has the
>> same
>> WWID as the device it's a snapshot of, it must not be exposed to any
>> host(s) at the same time with its origin, otherwise the host may
>> happily combine it with the origin into one multipath map, and data
>> corruption will almost certainly result. 
>> 
>> My argument is about how the host is supposed to deal with a WWID
>> change if it happens. Here, "WWID change" means that a given H:C:T:L
>> suddenly exposes different device designators than it used to, while
>> this device is in use by a host. Here, too, data corruption is
>> imminent, and can happen in a blink of an eye. To avoid this, several
>> things are needed:
>> 
>>  1) the host needs to get notified about the change (likely by an UA
>> of
>> some sort)
>>  2) the kernel needs to react to the notification immediately, e.g.
>> by
>> blocking IO to the device,
>>  3) userspace tooling such as udev or multipathd need to figure out
>> how
>> to  how to deal with the situation cleanly, and eventually unblock
>> it.
>> 
>> Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
>> afaics.
>> 
> In my view the WWID should never change. If a snapshot is created it
> should either obtain a new WWID. An example out of a Hitachi array is
> 
> Device Identification VPD page:
> Addressed logical unit:
> designator type: T10 vendor identification, code set: ASCII
> vendor id: HITACHI 
> vendor specific: 50403B050709
> designator type: NAA, code set: Binary
> 0x60060e80123b050050403b0500000709
> 
> The majority of the naa wwid is tied to the storage subsystem and
> identifies the vendor oui, model, serial etc. The last 4 in this
> example indicate the LDEV ID (Sorry mainframe heritage here..). When a
> snapshot is taken these 4 will change as a new LDEV ID is assigned to
> the snapshot. This sort of behaviour should be consistent across all
> storage vendors imho.

It's getting off-topic, but in automatic desaster recovery scenarios one might want that the "new disk" (maybe a snapshot of the original disk before it got corrupted) looks like the "old disk", so that the OS can boot without needing any adjustments.

Regards,
Ulrich

> 
>> Martin
>>