On 7/15/21 6:57 PM, Brian Bunker wrote:
When paths return an ALUA state transition, do not fail those paths. The expectation is that the transition is short lived until the new ALUA state is entered. There might not be other paths in an online state to serve the request which can lead to an unexpected I/O error on the multipath device. Signed-off-by: Brian Bunker <brian@xxxxxxxxxxxxxxx> Acked-by: Krishna Kant <krishna.kant@xxxxxxxxxxxxxxx> Acked-by: Seamus Connor <sconnor@xxxxxxxxxxxxxxx> -- diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index bced42f082b0..28948cc481f9 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1652,12 +1652,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (error && blk_path_error(error)) { struct multipath *m = ti->private; - if (error == BLK_STS_RESOURCE) + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) r = DM_ENDIO_DELAY_REQUEUE; else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath); if (!atomic_read(&m->nr_valid_paths) && --
Sorry, but this will lead to regressions during failover for arrays taking longer time (some take up to 30 minutes for a complete failover). And for those it's absolutely crucial to _not_ retry I/O on the paths in transitioning.
And you already admitted that 'queue_if_no_path' would resolve this problem, so why not update the device configuration in multipath-tools to have the correct setting for your array?
Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer