On Mon, 2011-12-19 at 05:16 -0500, Bart Van Assche wrote: > On Mon, Dec 19, 2011 at 12:50 AM, David Dillow <dillowda@xxxxxxxx> wrote: > > On Thu, 2011-12-01 at 20:11 +0100, Bart Van Assche wrote: > >> Add a time-based transport layer test such that fail-over in a multipath > >> setup can happen quickly. > > > > Why should this be done in the kernel? multipathd already verifies all > > paths to a SCSI device are up and that the device is reachable. > > I'm afraid it's impossible to make a transport layer check work > reliably from user space. As an example, srp_reset_host() blocks the > SCSI host before reconnecting. Before starting to attempt to > reconnect, that action does block the SCSI host and hence also all > transport layer checks issued from user space. I doubt it's possible > to fix the resulting race between a transport layer reconnect issued > from srp_reset_host() and a transport layer reconnect triggered from > user space. Part of the problem is introduced by allowing for permanent connections rather than using the familiar dev_loss_tmo and fast_io_fail_tmo parameters from other SCSI transports. For instance, in the FC transport, rports are allowed to disappear for up to dev_loss_tmo seconds before being removed from the SCSI device tree. Until they have been gone for fast_fail_io_tmo seconds, they are blocked (as is error handling to prevent offlining devices). Once they have been gone longer that fast_fail_io_tmo, they become unblocked and new IO will be failed. Now, the FC transport is probably a bit more complex than we want right now, but following it (and the SAS transport's) lead should keep us in sync with where the rest of the SCSI stack is headed. As for reliability from user space, multipathd checks that the SCSI initiator is not blocked before checking the liveness of the path; blocked paths are assumed to be down. It still seems to default to a 30 second timeout for the test TUR/directIO/etc check, and that doesn't currently look to be configurable (fixable). These timeouts are independent of the SCSI layer, and will mark a path down for new traffic without waiting for the lower level timeout. The transport layer reconnect from user space (I'm assuming you were thinking ioctl or using the sg device) would coordinated by the SCSI-midlayer, using calls to srp_reset_host(), so I think we avoid any race condition. Or did you mean a manual reconnect attempt from manipulating srp_host/X/reconnect_tmo via sysfs -- in which case it is in our code, so we can certainly avoid race conditions. I still think this is already solved in user space, but the new reconnect model you've implemented doesn't match up with the expected semantics. It'd be better to match the rest of the SCSI stack for this. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html