Hello Brian, On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > Martin, > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > That commit should cause the RTPG to be retried on other map > > members > > which are not in failed state, thus avoiding this phenomenon. > > In my case, there are no other map members that are not in the failed > state. One set of paths goes to the ALUA unavailable state when the > primary fails, and the second set of paths moves to ALUA state > transitioning as the previous secondary becomes the primary. IMO this is the problem. How does your array respond to SCSI commands while ports are transitioning? SPC5 (§5.16.2.6) says that the server should either fail all commands with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support all TMFs and a subset of SCSI commands, while responding with CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. No matter how you read that paragraph, it's pretty clear that "transitioning" is generally not a healthy state to attempt I/O. Are you saying that on your server, the transitioning ports are able to process regular I/O commands like READ and WRITE? If that's the case, why do you pretend to be "transitioning" at all, rather than in an active state? If it's not the case, why would it make sense for the host to retry I/O on the transitioning path? > If the > paths are failed which are transitioning, an all paths down state > happens which is not expected. IMO it _is_ expected if in fact no path is able to process SCSI commands at the given point in time. > There should be a time for which > transitioning is a transient state until the next state is entered. > Failing a path assuming there would be non-failed paths seems wrong. This is a misunderstanding. The path isn't failed because of assumptions about other paths. It is failed because we know that it's non-functional, and thus we must try other paths, if there are any. Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"), I/O was indeed retried on transitioning paths, possibly forever. This posed a serious problem when a transitioning path was about to be removed (e.g. dev_loss_tmo expired). And I'm still quite convinced that it was wrong in general, because by all reasonable means a "transitioning" path isn't usable for the host. If we find a transitioning path, it might make sense to retry on other paths first and eventually switch back to the transitioning path, when all others have failed hard (e.g. "unavailable" state). However, this logic doesn't exist in the kernel. In theory, it could be mapped to a "transitioning" priority group in device-mapper multipath. But prio groups are managed in user space (multipathd), which treats transitioning paths as "almost failed" (priority 0) anyway. We can discuss enhancing multipathd such that it re-checks transitioning paths more frequently, in order to be able to reinstate them ASAP. According to what you said above, the "transitioning" ports in the problem situation ("second set") are those that were in "unavailable" state before, which means "failed" as far as device mapper is concerned - IOW, the paths in question would be unused anyway, until they got reinstated, which wouldn't happen before they are fully up. With this in mind, I have to say I don't understand why your proposed patch would help at all. Please explain. > > The purpose of that patch was to set the state of the transitioning > > path to failed in order to make sure IO is retried on a different > > path. > > Your patch would undermine this purpose. (Additional indentation added by me) Can you please use proper quoting? You were mixing my statements and your own. > I agree this is what happens but those transitioning paths might be > the only non-failed paths available. I don't think it is reasonable > to > fail them. This is the same as treating transitioning as standby or > unavailable. Right, and according to the SPC spec (see above), that makes more sense than treating it as "active". Storage vendors seem to interpret "transitioning" very differently, both in terms of commands supported and in terms of time required to reach the target state. That makes it hard to deal with it correctly on the host side. > As you point out this happened with the commit you > mention. Before this commit what I am doing does not result in an all > paths down error, and similarly, it does not in earlier Linux > versions > or other OS's under the same condition. I see this as a regression. If you use a suitable "no_path_retry" setting in multipathd, you should be able to handle the situation you describe just fine by queueing the I/O until the transitioning paths are fully up. IIUC, on your server "transitioning" is a transient state that ends quickly, so queueing shouldn't be an issue. E.g. if you are certain that "transitioning" won't last longer than 10s, you could set "no_path_retry 2". Regards, Martin