Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning

Martin Wilck <mwilck@xxxxxxxx> · Tue, 13 Jul 2021 11:13:09 +0200

Hello Brian,

On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:
> Martin,
> 
> > Please confirm that your kernel included ee8868c5c78f ("scsi:
> > scsi_dh_alua: Retry RTPG on a different path after failure").
> > That commit should cause the RTPG to be retried on other map 
> > members
> > which are not in failed state, thus avoiding this phenomenon.
> 
> In my case, there are no other map members that are not in the failed
> state. One set of paths goes to the ALUA unavailable state when the
> primary fails, and the second set of paths moves to ALUA state
> transitioning as the previous secondary becomes the primary.

IMO this is the problem. How does your array respond to SCSI commands
while ports are transitioning? 

SPC5 (§5.16.2.6) says that the server should either fail all commands
with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT
ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support
all TMFs and a subset of SCSI commands, while responding with
CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.

No matter how you read that paragraph, it's pretty clear that
"transitioning" is generally not a healthy state to attempt I/O.

Are you saying that on your server, the transitioning ports are able to
process regular I/O commands like READ and WRITE? If that's the case,
why do you pretend to be "transitioning" at all, rather than in an
active state? If it's not the case, why would it make sense for the
host to retry I/O on the transitioning path?

>  If the
> paths are failed which are transitioning, an all paths down state
> happens which is not expected.

IMO it _is_ expected if in fact no path is able to process SCSI
commands at the given point in time.

>  There should be a time for which
> transitioning is a transient state until the next state is entered.
> Failing a path assuming there would be non-failed paths seems wrong.

This is a misunderstanding. The path isn't failed because of
assumptions about other paths. It is failed because we know that it's
non-functional, and thus we must try other paths, if there are any.

Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
transitioning state"), I/O was indeed retried on transitioning paths,
possibly forever. This posed a serious problem when a transitioning
path was about to be removed (e.g. dev_loss_tmo expired). And I'm still
quite convinced that it was wrong in general, because by all reasonable
means a "transitioning" path isn't usable for the host.

If we find a transitioning path, it might make sense to retry on other
paths first and eventually switch back to the transitioning path, when
all others have failed hard (e.g. "unavailable" state). However, this
logic doesn't exist in the kernel. In theory, it could be mapped to a
"transitioning" priority group in device-mapper multipath. But prio
groups are managed in user space (multipathd), which treats
transitioning paths as "almost failed" (priority 0) anyway. We can
discuss enhancing multipathd such that it re-checks transitioning paths
more frequently, in order to be able to reinstate them ASAP.

According to what you said above, the "transitioning" ports in the
problem situation ("second set") are those that were in "unavailable"
state before, which means "failed" as far as device mapper is concerned
- IOW, the paths in question would be unused anyway, until they got
reinstated, which wouldn't happen before they are fully up. With this
in mind, I have to say I don't understand why your proposed patch would
help at all. Please explain.

> > The purpose of that patch was to set the state of the transitioning
> > path to failed in order to make sure IO is retried on a different  >
> path.
> > Your patch would undermine this purpose.

(Additional indentation added by me) Can you please use proper quoting?
You were mixing my statements and your own. 

> I agree this is what happens but those transitioning paths might be
> the only non-failed paths available. I don't think it is reasonable
> to
> fail them. This is the same as treating transitioning as standby or
> unavailable.

Right, and according to the SPC spec (see above), that makes more sense
than treating it as "active".

Storage vendors seem to interpret "transitioning" very differently,
both in terms of commands supported and in terms of time required to
reach the target state. That makes it hard to deal with it correctly on
the host side.

>  As you point out this happened with the commit you
> mention. Before this commit what I am doing does not result in an all
> paths down error, and similarly, it does not in earlier Linux
> versions
> or other OS's under the same condition. I see this as a regression.

If you use a suitable "no_path_retry" setting in multipathd, you should
be able to handle the situation you describe just fine by queueing the
I/O until the transitioning paths are fully up. IIUC, on your server
"transitioning" is a transient state that ends quickly, so queueing
shouldn't be an issue. E.g. if you are certain that "transitioning"
won't last longer than 10s, you could set "no_path_retry 2".

Regards,
Martin