Just a note or two: > My proposal is to handle this in several stages: > > - path fails > -> Send out netlink event > -> start dev_loss_tmo and fast_fail_io timer > -> fast_fail_io timer triggers: Abort all oustanding I/O with > DID_TRANSPORT_DISRUPTED, return DID_TRANSPORT_FAILFAST for > any future I/O, and send out netlink event. > -> dev_loss_tmo timer triggers: Remove sdev and cleanup rport. > netlink event is sent implicitely by removing the sdev. > > Multipath would then interact with this sequence by: > > - Upon receiving 'path failed' event: mark path as 'ghost' or 'blocked', > ie no I/O is currently possible and will be queued (no path switch yet). > - Upon receiving 'fast_fail_io' event: switch paths and resubmit queued I/Os > - Upon receiving 'path removed' event: remove path from internal structures, update multipath maps etc. This makes perfect sense to me. Are we going to allow the end-user to modify those timers (not sure that's a good idea...)? > The time between 'path failed' and 'fast_fail_io triggers' would then be > able to capture any jitter / intermittent failures. Between > 'fast_fail_io triggers' and 'path removed' the path would be held in some > sort of 'limbo' in case it comes back again, eg for maintenance/SP update > etc. And we can even increase this one to rather long timespans (eg hours) > to give the admin enough time for a manual intervention. > I still like this proposal as it makes multipath interaction far cleaner. > And we can do away with path checkers completely here. All true. Although I think the "long" timespans might be best measured in minutes (say, default to 5 minutes) and should be configurable. It probably isn't a good idea to leave that path dead for a very long time as a rule, even if it's possible to do so. Maybe even some sort of userland override would be worthwhile for scheduled maintenance? Regards, Jerry -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel