On Wed, Oct 21, 2009 at 12:33:25PM -0400, James Smart wrote: > Here's what I remember about this from the past: > > - This was originally added when dealing with older kernels that didn't > have the eh patch that bounced the timeout handler when the rport was > blocked (see fc_timed_out). > > The eh patch avoided entering the eh thread upon i/o timeouts if the > rport was blocked. > > > - As mentioned in my prior email - there's a window where things can be > entered before the target blocked state protects you. What if you are in > the eh_handler when it occurs ? Unfortunately, the eh thread is very > black and white on abort/reset/io status - its either success or not. It > doesn't validate the "not" cases, never looks at retry conditions, and > just assumes hard failure - which was taking everyone down bad paths. > This is a rats nest to resolve right, and I think I mentioned it on the > list a long time ago with Christoph. Thus the stall was added to plug the > hole. I could imagine the case where something triggers the SCSI eh thread (e.g. a command timeout), but as soon as the eh thread starts the recovery, something else prevents access to the remote port. Returning DID_IMM_RETRY from fc_remote_port_chkready or the LLD does not help, scsi_eh_completed_normally simply maps this error code to FAILED. Holding off the SCSI eh thread until the rport leaves the BLOCKED state will guarantee that we can either reach the rport or the SCSI devices are being removed anyway. It looks to me like this is required to prevent offline SCSI devices in this case. Should this code be in a FC transport helper function rather than being duplicated in each FC LLD? Christof > Christof Schmitt wrote: >> On Tue, Oct 20, 2009 at 04:40:27PM +0200, Christof Schmitt wrote: >>> If the remote_port status is not BLOCKED, this will trigger the SCSI >>> midlayer error handling which cannot do much during the interruption >>> to the hardware and will mark the SCSI devices 'offline'. In order to >>> prevent this, the rule would be: First call fc_remote_port_delete to >>> set the remote port (or in the case of an HBA interruption all remote >>> ports) to BLOCKED, and only after this step call scsi_done to pass the >>> SCSI commands back to the upper layers. >> >> I just stumbled across a loop that blocks the SCSI error handling >> thread: >> >> spin_lock_irqsave(shost->host_lock, flags); >> while (rport->port_state == FC_PORTSTATE_BLOCKED) { >> spin_unlock_irqrestore(shost->host_lock, flags); >> msleep(1000); >> spin_lock_irqsave(shost->host_lock, flags); >> } >> spin_unlock_irqrestore(shost->host_lock, flags); >> >> This seems to be popular among FC drivers. Is this the preferred way >> to synchronize the FC transport class state changes with the SCSI >> midlayer error recovery? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html