On Fri, Jul 16, 2021 at 1:25 AM Martin Wilck <mwilck@xxxxxxxx> wrote: > > Hannes, > > On Fr, 2021-07-16 at 08:27 +0200, Hannes Reinecke wrote: > > On 7/15/21 6:57 PM, Brian Bunker wrote: > > > When paths return an ALUA state transition, do not fail those paths. > > > The expectation is that the transition is short lived until the new > > > ALUA state is entered. There might not be other paths in an online > > > state to serve the request which can lead to an unexpected I/O error > > > on the multipath device. > > > > > > Signed-off-by: Brian Bunker <brian@xxxxxxxxxxxxxxx> > > > Acked-by: Krishna Kant <krishna.kant@xxxxxxxxxxxxxxx> > > > Acked-by: Seamus Connor <sconnor@xxxxxxxxxxxxxxx> > > > -- > > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > > index bced42f082b0..28948cc481f9 100644 > > > --- a/drivers/md/dm-mpath.c > > > +++ b/drivers/md/dm-mpath.c > > > @@ -1652,12 +1652,12 @@ static int multipath_end_io(struct dm_target > > > *ti, struct request *clone, > > > if (error && blk_path_error(error)) { > > > struct multipath *m = ti->private; > > > > > > - if (error == BLK_STS_RESOURCE) > > > + if (error == BLK_STS_RESOURCE || error == > > > BLK_STS_AGAIN) > > > r = DM_ENDIO_DELAY_REQUEUE; > > > else > > > r = DM_ENDIO_REQUEUE; > > > > > > - if (pgpath) > > > + if (pgpath && (error != BLK_STS_AGAIN)) > > > fail_path(pgpath); > > > > > > if (!atomic_read(&m->nr_valid_paths) && > > > -- > > > > Sorry, but this will lead to regressions during failover for arrays > > taking longer time (some take up to 30 minutes for a complete > > failover). > > And for those it's absolutely crucial to _not_ retry I/O on the paths > > in > > transitioning. > > This won't happen. > > As argued in https://marc.info/?l=linux-scsi&m=162625194203635&w=2, > your previous patches avoid the request being requeued on the SCSI > queue, even with Brian's patch on top. IMO that means that the deadlock > situation analyzed earlier can't occur. If you disagree, please explain > in detail. > > By not failing the path, the request can be requeued on the dm level, > and dm can decide to try the transitioning path again. But the request > wouldn't be added to the SCSI queue, because alua_prep_fn() would > prevent that. > > So, in the worst case, dm would retry queueing the request on the > transitioning path over and over again. By adding a delay, we avoid > busy-looping. This has basically the same effect as queueing on the dm > layer in the first place: the request stays queued on the dm level most > of the time. Except for the fact that the queueing would stop earlier: > as soon as the kernel notices that the path is not transitioning any > more. By not failing the dm paths, we don't depend on user space to > reinstate them, which is the right thing to do for a transitioning > state IMO. > > > And you already admitted that 'queue_if_no_path' would resolve this > > problem, so why not update the device configuration in multipath- > > tools > > to have the correct setting for your array? The reason I don't like queue_if_no_path for a solution is that there are times we do want to tail all of the paths as quickly as possible (e.g. a cluster sharing a resource). If we add some non zero value to no_path_retry we would be forcing that configuration to unnecessarily wait, and those customers will see this delay as a regression. This is where a distinction between an ALUA state of standby or unavailable vs. a transition ALUA state is attractive. > > I think Brian is right that setting transitioning paths to failed state > in dm is highly questionable. > > So far, in dm-multipath, we haven't set paths to failed state because > of ALUA state transitions. We've been mapping ALUA states to priorities > instead. We don't even fail paths in ALUA "unavailable" state, so why > do it for "transitioning"? > > Thanks > Martin > > Thanks, Brian -- Brian Bunker PURE Storage, Inc. brian@xxxxxxxxxxxxxxx