Re: [PATCH] scsi: dm-mpath: do not fail paths when the target returns ALUA state transition

Martin Wilck <mwilck@xxxxxxxx> · Fri, 16 Jul 2021 10:25:48 +0200

Hannes,

On Fr, 2021-07-16 at 08:27 +0200, Hannes Reinecke wrote:
> On 7/15/21 6:57 PM, Brian Bunker wrote:
> > When paths return an ALUA state transition, do not fail those paths.
> > The expectation is that the transition is short lived until the new
> > ALUA state is entered. There might not be other paths in an online
> > state to serve the request which can lead to an unexpected I/O error
> > on the multipath device.
> > 
> > Signed-off-by: Brian Bunker <brian@xxxxxxxxxxxxxxx>
> > Acked-by: Krishna Kant <krishna.kant@xxxxxxxxxxxxxxx>
> > Acked-by: Seamus Connor <sconnor@xxxxxxxxxxxxxxx>
> > --
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index bced42f082b0..28948cc481f9 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -1652,12 +1652,12 @@ static int multipath_end_io(struct dm_target
> > *ti, struct request *clone,
> >          if (error && blk_path_error(error)) {
> >                  struct multipath *m = ti->private;
> > 
> > -               if (error == BLK_STS_RESOURCE)
> > +               if (error == BLK_STS_RESOURCE || error ==
> > BLK_STS_AGAIN)
> >                          r = DM_ENDIO_DELAY_REQUEUE;
> >                  else
> >                          r = DM_ENDIO_REQUEUE;
> > 
> > -               if (pgpath)
> > +               if (pgpath && (error != BLK_STS_AGAIN))
> >                          fail_path(pgpath);
> > 
> >                  if (!atomic_read(&m->nr_valid_paths) &&
> > --
> 
> Sorry, but this will lead to regressions during failover for arrays 
> taking longer time (some take up to 30 minutes for a complete
> failover).
> And for those it's absolutely crucial to _not_ retry I/O on the paths
> in 
> transitioning.

This won't happen.

As argued in https://marc.info/?l=linux-scsi&m=162625194203635&w=2,
your previous patches avoid the request being requeued on the SCSI
queue, even with Brian's patch on top. IMO that means that the deadlock
situation analyzed earlier can't occur. If you disagree, please explain
in detail.

By not failing the path, the request can be requeued on the dm level,
and dm can decide to try the transitioning path again. But the request
wouldn't be added to the SCSI queue, because alua_prep_fn() would
prevent that. 

So, in the worst case, dm would retry queueing the request on the
transitioning path over and over again. By adding a delay, we avoid
busy-looping. This has basically the same effect as queueing on the dm
layer in the first place: the request stays queued on the dm level most
of the time. Except for the fact that the queueing would stop earlier:
as soon as the kernel notices that the path is not transitioning any
more. By not failing the dm paths, we don't depend on user space to
reinstate them, which is the right thing to do for a transitioning
state IMO.

> And you already admitted that 'queue_if_no_path' would resolve this 
> problem, so why not update the device configuration in multipath-
> tools 
> to have the correct setting for your array?

I think Brian is right that setting transitioning paths to failed state
in dm is highly questionable.

So far, in dm-multipath, we haven't set paths to failed state because
of ALUA state transitions. We've been mapping ALUA states to priorities
instead. We don't even fail paths in ALUA "unavailable" state, so why
do it for "transitioning"?

Thanks
Martin