On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote: > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@xxxxxxxx> wrote: > > Are you saying that on your server, the transitioning ports are able > > to > > process regular I/O commands like READ and WRITE? If that's the case, > > why do you pretend to be "transitioning" at all, rather than in an > > active state? If it's not the case, why would it make sense for the > > host to retry I/O on the transitioning path? > > In the ALUA transitioning state, we cannot process READ or WRITE and > will return with the sense data as you mentioned above. We expect > retries down that transitioning path until it transitions to another > ALUA state (at least for some reasonable period of time for the > transition). The spec defines the state as the transition between > target asymmetric states. The current implementation requires > coordination on the target not to return a state transition down all > paths at the same time or risk all paths being failed. Using the ALUA > transition state allows us to respond to initiator READ and WRITE > requests even if we can't serve them when our internal target state is > transitioning (secondary to primary). The alternative is to queue them > which presents a different set of problems. Indeed, it would be less prone to host-side errors if the "new" pathgroup went to a usable state before the "old" pathgroup got unavailable. Granted, this may be difficult to guarantee on the storage side. > > > If the > > > paths are failed which are transitioning, an all paths down state > > > happens which is not expected. > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > commands at the given point in time. > > In this case it would seem having all paths move to transitioning > would lead to all paths lost. It is possible to imagine > implementations where for a brief period of time all paths are in a > transitioning state. What would be the point of returning a transient > state if the result is a permanent failure? When a command fails with ALUA TRANSITIONING status, we must make sure that: 1) The command itself is not retried on the path at hand, neither on the SCSI layer nor on the blk-mq layer. The former was the case anyway, the latter is guaranteed by 0d88232010d5 ("scsi: core: Return BLK_STS_AGAIN for ALUA transitioning"). 2) No new commands are sent down this path until it reaches a usable final state. This is achieved on the SCSI layer by alua_prep_fn(), with 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"). These two items would still be true with your patch applied. However, the problem is that if the path isn't failed, dm-multipath would continue sending I/O down this path. If the path isn't set to failed state, the path selector algorithm may or may not choose a different path next time. In the worst case, dm-multipath would busy-loop retrying the I/O on the same path. Please consider the following: diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 86518aec32b4..3f3a89fc2b3b 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (error && blk_path_error(error)) { struct multipath *m = ti->private; - if (error == BLK_STS_RESOURCE) + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) r = DM_ENDIO_DELAY_REQUEUE; else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath); This way we'd avoid busy-looping by delaying the retry. This would cause I/O delay in the case where some healthy paths are still in the same dm-multipath priority group as the transitioning path. I suppose this is a minor problem, because in the default case for ALUA (group_by_prio in multipathd), all paths in the PG would have switched to "transitioning" simultaneously. Note that multipathd assigns priority 0 (the same prio as "unavailable") if it happens to notice a "transitioning" path. This is something we may want to change eventually. In your specific case, it would cause the paths to be temporarily re-grouped, all paths being moved to the same non-functional PG. The way you describe it, for your storage at least, "transitioning" should be assigned a higher priority. > > > If you use a suitable "no_path_retry" setting in multipathd, you > > should > > be able to handle the situation you describe just fine by queueing > > the > > I/O until the transitioning paths are fully up. IIUC, on your > > server > > "transitioning" is a transient state that ends quickly, so queueing > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > won't last longer than 10s, you could set "no_path_retry 2". > > I have tested using the no_path_retry and you are correct that it > does > work around the issue that I am seeing. The problem with that is are > times > we want to convey all paths down to the initiator as quickly > as possible and doing this will delay that. Ok, that makes sense e.g. for cluster configurations. So, the purpose is to distinguish between two cases where no path can process SCSI commands: a) all paths are really down on the storage, and b) some paths are down and some are "coming up". Thanks, Martin