Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning

Martin Wilck <mwilck@xxxxxxxx> · Wed, 14 Jul 2021 10:38:57 +0200

On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote:
> On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@xxxxxxxx> wrote:
> > Are you saying that on your server, the transitioning ports are able
> > to
> > process regular I/O commands like READ and WRITE? If that's the case,
> > why do you pretend to be "transitioning" at all, rather than in an
> > active state? If it's not the case, why would it make sense for the
> > host to retry I/O on the transitioning path?
> 
> In the ALUA transitioning state, we cannot process READ or WRITE and
> will return with the sense data as you mentioned above. We expect
> retries down that transitioning path until it transitions to another
> ALUA state (at least for some reasonable period of time for the
> transition). The spec defines the state as the transition between
> target asymmetric states. The current implementation requires
> coordination on the target not to return a state transition down all
> paths at the same time or risk all paths being failed. Using the ALUA
> transition state allows us to respond to initiator READ and WRITE
> requests even if we can't serve them when our internal target state is
> transitioning (secondary to primary). The alternative is to queue them
> which presents a different set of problems.

Indeed, it would be less prone to host-side errors if the "new"
pathgroup went to a usable state before the "old" pathgroup got
unavailable. Granted, this may be difficult to guarantee on the storage
side.

> > >  If the
> > > paths are failed which are transitioning, an all paths down state
> > > happens which is not expected.
> > 
> > IMO it _is_ expected if in fact no path is able to process SCSI
> > commands at the given point in time.
> 
> In this case it would seem having all paths move to transitioning
> would lead to all paths lost. It is possible to imagine
> implementations where for a brief period of time all paths are in a
> transitioning state. What would be the point of returning a transient
> state if the result is a permanent failure?

When a command fails with ALUA TRANSITIONING status, we must make sure
that:

 1) The command itself is not retried on the path at hand, neither on
the SCSI layer nor on the blk-mq layer. The former was the case anyway,
the latter is guaranteed by 0d88232010d5 ("scsi: core: Return
BLK_STS_AGAIN for ALUA transitioning").

 2) No new commands are sent down this path until it reaches a usable
final state. This is achieved on the SCSI layer by alua_prep_fn(), with
268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
transitioning state").

These two items would still be true with your patch applied. However,
the problem is that if the path isn't failed, dm-multipath would
continue sending I/O down this path. If the path isn't set to failed
state, the path selector algorithm may or may not choose a different
path next time. In the worst case, dm-multipath would busy-loop
retrying the I/O on the same path. Please consider the following:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 86518aec32b4..3f3a89fc2b3b 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
        if (error && blk_path_error(error)) {
                struct multipath *m = ti->private;
 
-               if (error == BLK_STS_RESOURCE)
+               if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN)
                        r = DM_ENDIO_DELAY_REQUEUE;
                else
                        r = DM_ENDIO_REQUEUE;
 
-               if (pgpath)
+               if (pgpath && (error != BLK_STS_AGAIN))
                        fail_path(pgpath);

This way we'd avoid busy-looping by delaying the retry. This would
cause I/O delay in the case where some healthy paths are still in the
same dm-multipath priority group as the transitioning path. I suppose
this is a minor problem, because in the default case for ALUA
(group_by_prio in multipathd), all paths in the PG would have switched
to "transitioning" simultaneously.
 
Note that multipathd assigns priority 0 (the same prio as
"unavailable") if it happens to notice a "transitioning" path. This is
something we may want to change eventually. In your specific case, it
would cause the paths to be temporarily re-grouped, all paths being
moved to the same non-functional PG. The way you describe it, for your
storage at least, "transitioning" should be assigned a higher priority.
> 

> > If you use a suitable "no_path_retry" setting in multipathd, you
> > should
> > be able to handle the situation you describe just fine by queueing
> > the
> > I/O until the transitioning paths are fully up. IIUC, on your
> > server
> > "transitioning" is a transient state that ends quickly, so queueing
> > shouldn't be an issue. E.g. if you are certain that "transitioning"
> > won't last longer than 10s, you could set "no_path_retry 2".
> 
> I have tested using the no_path_retry and you are correct that it
> does
> work around the issue that I am seeing. The problem with that is are
> times
> we want to convey all paths down to the initiator as quickly
> as possible and doing this will delay that.

Ok, that makes sense e.g. for cluster configurations. So, the purpose
is  to distinguish between two cases where no path can process SCSI
commands: a) all paths are really down on the storage, and b) some
paths are down and some are "coming up".

Thanks,
Martin