On Tue, 2021-04-27 at 16:41 -0400, Ewan D. Milne wrote: > On Tue, 2021-04-27 at 20:33 +0000, Martin Wilck wrote: > > On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote: > > > > > > There's no way to do that, in principle. Because there could be > > > other I/Os in flight. You might (somehow) avoid retrying an I/O > > > that got a UA until you figured out if something changed, but other > > > I/Os can already have been sent to the target, or issued before you > > > get to look at the status. > > > > Right. But in practice, a WWID change will hardly happen under full > > IO > > load. The storage side will probably have to block IO while this > > happens, at least for a short time period. So blocking and quiescing > > the queue upon an UA might still work, most of the time. Even if we > > were too late already, the sooner we stop the queue, the better. > > > > The current algorithm in multipath-tools needs to detect a path going > > down and being reinstated. The time interval during which a WWID > > change > > will go unnoticed is one or more path checker intervals, typically on > > the order of 5-30 seconds. If we could decrease this interval to a > > sub- > > second or even millisecond range by blocking the queue in the kernel > > quickly, we'd have made a big step forward. > > Yes, and in many situations this may help. But in the general case > we can't protect against a storage array misconfiguration, > where something like this can happen. So I worry about people > believing the host software will protect them against a mistake, > when we can't really do that. I agree. I expressed a similar notion in the following thread about multipathd's WWID change detection capabilities in the face of really bad mistakes on the administrator's (or storage array's, FTM) part: https://listman.redhat.com/archives/dm-devel/2021-February/msg00248.html But others stressed that nonetheless we should try our best to avoid customer data corruption (which I agree with, too), and thus we settled on the current algorithm, which suited the needs at least of the affected user(s) in that specific case. Personally I think that the current "5-30s" time period for WWID change detection in multipathd is unsafe both theoretically and practially, and may lure users into a false feeling of safety. Therefore I'd strongly welcome a kernel-side solution that might still not be safe theoretically, but cover most practical problem scenarios much better than we currently do. Regards Martin -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Software Solutions Germany GmbH HRB 36809, AG Nürnberg GF: Felix Imendörffer