Re: [PATCH 2/2] multipathd: add recheck_wwid_time option to verify the path wwid

Benjamin Block <bblock@xxxxxxxxxxxxx> · Thu, 11 Feb 2021 12:25:01 +0100

On Wed, Feb 10, 2021 at 07:09:31PM +0100, Benjamin Block wrote:
> On Tue, Feb 09, 2021 at 10:19:45PM +0000, Martin Wilck wrote:
> > On Mon, 2021-02-08 at 23:19 -0600, Benjamin Marzinski wrote:
> > > There are cases where the wwid of a path changes due to LUN remapping
> > > without triggering uevent for the changed path. Multipathd has no
> > > method
> > > for trying to catch these cases, and corruption has resulted because
> > > of
> > > it.
> > > 
> > > In order to have a better chance at catching these cases, multipath
> > > now
> > > has a recheck_wwid_time option, which can either be set to "off" or a
> > > number of seconds. If a path is failed for equal to or greater than
> > > the
> > > configured number of seconds, multipathd will recheck its wwid before
> > > restoring it, when the path checker sees that it has come back up.
> > 
> > Can't the WWID change also happen without the path going offline, or
> > at least without being offline long enough that multipathd would
> > notice?
> > 
> > >  If
> > > multipathd notices that a path's wwid has changed it will remove and
> > > re-add the path, just like the existing wwid checking code for change
> > > events does.  In cases where the no uevent occurs, both the udev
> > > database entry and sysfs will have the old wwid, so the only way to
> > > get
> > > a current wwid is to ask the device directly. 
> > 
> > sysfs is wrong too, really? In that case I fear triggering an uevent
> > won't fix the situation. You need to force the kernel to rescan the
> > device, otherwise udev will fetch the WWID from sysfs again, which
> > still has the wrong ID... or what am I missing here?
> > 
> > > > Currently multipath only
> > > has code to directly get the wwid for scsi devices, so this option
> > > only
> > > effects scsi devices. Also, since it's possible the the udev wwid
> > > won't
> > > match the wwid from get_vpd_sgio(), if multipathd doesn't initially
> > > see
> > > the two values matching for a device, it will disable this option for
> > > that device.
> > > 
> > > If recheck_wwid_time is not turned off, multipathd will also
> > > automatically recheck the wwid whenever an existing path gets a add
> > > event, or is manually re-added with cli_add_path().
> > > 
> > > Co-developed-by: Chongyun Wu <wucy11@xxxxxxxxxxxxxxx>
> > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > 
> > I am uncertain about this.
> > 
> > We get one more configuration option that defaults to off and that only
> > the truly inaugurated will understand and use. And even those will not
> > know how to set the recheck time. Should it be 1s, 10, or 100? We
> > already have too many of these options in multipath-tools. We shy away
> > from giving users reasonable defaults, with the result that most people
> > won't bother.
> > 
> > I generally don't understand what the UP/DOWN state has to do with
> > this. If the WWID can change without any events seen by either the
> > kernel or user space, why would the path go down and up again? And even
> > if so, why would it matter how long the device remained down?
> > 
> > But foremost, do we really have to try to deal with configuration
> > mistakes as blatant as this? What if a user sets the same WWID for
> > different devices, or re-uses the same WWID on different storage
> > servers? I already hesitated about the code I added myself for catching
> > user errors in the WWIDs file, but this seems even more far-fetched.
> > 
> > Please convince me.
> > 
> > This said, I'd like to understand why there are no events in these
> > cases. I'd have thought we'd at least get a UNIT ATTENTION (REPORTED
> > LUNS DATA HAS CHANGED), which would have caused a uevent. If there was
> > no UNIT ATTENTION, I'd blame the storage side. 
> 
> Yeah, just for reference, I saw this happening in practice when
> something with the LU mapping changed on IBM storage - IIRC I saw it
> with capacity changes. You end up in this code in the kernel:
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/scsi_error.c?id=92bf22614b21a2706f4993b278017e437f7785b3#n416
> 
> And from there you ought to get an uevent for the sdev.
> 
> The WWID in sysfs might still be wrong though AFAIK. The kernel seems to
> ignore the UA after it delivered the uevent.
> 

Sorry, I replied with the wrong mail address.

-- 
Best Regards, Benjamin Block  / Linux on IBM Z Kernel Development / IBM Systems
IBM Deutschland Research & Development GmbH    /    https://www.ibm.com/privacy
Vorsitz. AufsR.: Gregor Pillen         /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel