Roland Dreier wrote: > [CC'ing linux-scsi as well -- I think we'll get better insight from there] > > > The current SRP initiator code cannot work with several fail-over mechanisms. > > > > The current srp driver's behavior when a target off-line then online: > > 1) The target is offline. > > 2) the initiator tries to reconnect and fails > > 3) The initiator calls srp_remove_work that removes the scsi_host. > > 4) The target is back online. > > 5) the user (or the ibsrpdm daemon) is expected to execute a new add_target. > > 6) This creates a new scsi_host (with new names to the devices and new index in > > the scsi_host directory in sysfs) for this target. > > > > Fail-over drivers (e.g., MPP that is used by Engenio and XVM that is used by > > SGI) have problems with this behavior (item 3). They need the scsi_host to keep > > exist and return errors in the meanwhile until the connection to the target > > resumes. > > OK, but is this a valid assumption? What happens for iSCSI and/or iSER? I do not see why the host has to remain constant for the above problem. I can understand why it may be easier to program though. However, this is not a requirement for other multipath drivers like dm-multipath or md multpiath and I do not think you should rely on that type of behavior. The short story is that I think we are moving to something similar to what srp does very soon. The long story.... iscsi and iser allocate a host per session (session is allocated in the host's hostdata). If there are problems with the connection (target goes unreachable for N number of seconds or we get some error value from the network layer, etc) we keep the host, session, connection, target and scsi devices around and try to reconnect. We then have a userspace daemon that tries to reconnect to the target and relogin. If we reconnect within X seconds (we call this the replacement_timeout and it is similar to the FC class dev_loss_tmo), we reuse those structs and go on as normal. If after replacement_timeout seconds we do not reconnect, we can remove the host, session, connection, target and scsi_devices or we can keep them around and reuse them if we later reconnect. If we remove those structs we later have to allocate new ones of course and will get a new host number. Whether we use the model of reusing the structs or removing them is controlled in userspace and we currently do the wrong thing by default and keep the structs around. I guess what we are supposed to do is something similar to the FC class where if dev_loss_tmo expires then we should remove the session, connection, target and devices. I am not sure if we should be removing the scsi host though. I think it makes sense to remove that too, since the host and session are so closely tied in our model. We are in the process to moving to the model where all the structs are removed as the default and only model we support, and it looks like we will do this in 2.6.19. - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html