On Wed, Feb 10, 2021 at 07:09:31PM +0100, Benjamin Block wrote: > On Tue, Feb 09, 2021 at 10:19:45PM +0000, Martin Wilck wrote: > > On Mon, 2021-02-08 at 23:19 -0600, Benjamin Marzinski wrote: > > > There are cases where the wwid of a path changes due to LUN remapping > > > without triggering uevent for the changed path. Multipathd has no > > > method > > > for trying to catch these cases, and corruption has resulted because > > > of > > > it. > > > > > > In order to have a better chance at catching these cases, multipath > > > now > > > has a recheck_wwid_time option, which can either be set to "off" or a > > > number of seconds. If a path is failed for equal to or greater than > > > the > > > configured number of seconds, multipathd will recheck its wwid before > > > restoring it, when the path checker sees that it has come back up. > > > > Can't the WWID change also happen without the path going offline, or > > at least without being offline long enough that multipathd would > > notice? > > > > > If > > > multipathd notices that a path's wwid has changed it will remove and > > > re-add the path, just like the existing wwid checking code for change > > > events does. In cases where the no uevent occurs, both the udev > > > database entry and sysfs will have the old wwid, so the only way to > > > get > > > a current wwid is to ask the device directly. > > > > sysfs is wrong too, really? In that case I fear triggering an uevent > > won't fix the situation. You need to force the kernel to rescan the > > device, otherwise udev will fetch the WWID from sysfs again, which > > still has the wrong ID... or what am I missing here? > > > > > > Currently multipath only > > > has code to directly get the wwid for scsi devices, so this option > > > only > > > effects scsi devices. Also, since it's possible the the udev wwid > > > won't > > > match the wwid from get_vpd_sgio(), if multipathd doesn't initially > > > see > > > the two values matching for a device, it will disable this option for > > > that device. > > > > > > If recheck_wwid_time is not turned off, multipathd will also > > > automatically recheck the wwid whenever an existing path gets a add > > > event, or is manually re-added with cli_add_path(). > > > > > > Co-developed-by: Chongyun Wu <wucy11@xxxxxxxxxxxxxxx> > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > > I am uncertain about this. > > > > We get one more configuration option that defaults to off and that only > > the truly inaugurated will understand and use. And even those will not > > know how to set the recheck time. Should it be 1s, 10, or 100? We > > already have too many of these options in multipath-tools. We shy away > > from giving users reasonable defaults, with the result that most people > > won't bother. > > > > I generally don't understand what the UP/DOWN state has to do with > > this. If the WWID can change without any events seen by either the > > kernel or user space, why would the path go down and up again? And even > > if so, why would it matter how long the device remained down? > > > > But foremost, do we really have to try to deal with configuration > > mistakes as blatant as this? What if a user sets the same WWID for > > different devices, or re-uses the same WWID on different storage > > servers? I already hesitated about the code I added myself for catching > > user errors in the WWIDs file, but this seems even more far-fetched. > > > > Please convince me. > > > > This said, I'd like to understand why there are no events in these > > cases. I'd have thought we'd at least get a UNIT ATTENTION (REPORTED > > LUNS DATA HAS CHANGED), which would have caused a uevent. If there was > > no UNIT ATTENTION, I'd blame the storage side. > > Yeah, just for reference, I saw this happening in practice when > something with the LU mapping changed on IBM storage - IIRC I saw it > with capacity changes. You end up in this code in the kernel: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/scsi_error.c?id=92bf22614b21a2706f4993b278017e437f7785b3#n416 > > And from there you ought to get an uevent for the sdev. > > The WWID in sysfs might still be wrong though AFAIK. The kernel seems to > ignore the UA after it delivered the uevent. > Sorry, I replied with the wrong mail address. -- Best Regards, Benjamin Block / Linux on IBM Z Kernel Development / IBM Systems IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy Vorsitz. AufsR.: Gregor Pillen / Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294 -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel