On Thu, 2011-12-01 at 20:10 +0100, Bart Van Assche wrote: > Add the necessary functions in the SRP transport module to allow > an SRP initiator driver to implement transport layer recovery. > > Convert srp_target_port.qp_in_error into a target port state. This part should probably be folded into patch 7. > Factor out the code for removing an SRP target into the new function > srp_remove_target(). This could be a separate patch, making that easy to review. > In the ib_srp IB completion handlers, do not stop processing > completions after the first error completion or a CM DREQ has been > received. Why do this? It seems like you're just going to create a flurry of error messages when one would suffice, and we've got to tear down and reconnect anyway. > Block the SCSI target as soon as the first IB error > completion has been received. Eliminate the SCSI target state test > from srp_queuecommand(). I think you still want to check the SRP target state in srp_queuecommand() -- you schedule the blocking of the queue, but it could be a while before it happens. No sense sending down more commands when you know they are going to fail -- and if you're worried about the performance of the conditional in the hot path, we can make it unlikely(). Cleaning up the target states lets us just check for SRP_TARGET_LIVE, though you would want to change the creation sequence so that isn't the birth state of the target. > Rework ib_srp transport layer error handling. Instead of letting SCSI > commands time out if a transport layer error occurs, This is good, but should probably be part of the initial disconnect. We want to run the receive queue dry, processing responses to avoid unnecessarily resetting a command that was successful. > block the SCSI > host and try to reconnect until the reconnect timeout elapses or until > the maximum number of reconnect attempts has been exceeded, whichever > happens first. When we're blocked for a disconnected target, we may want to call the state something other than SRP_TARGET_BLOCKED. I'd like to eventually handle the case where the target leaves the fabric temporarily -- we don't necessarily disconnect in that case, but we need to block commands until it comes back or we give up on it. > /* > * Copyright (c) 2005 Cisco Systems. All rights reserved. > + * Copyright (c) 2010-2011 Bart Van Assche <bvanassche@xxxxxxx>. You've tried to add this in the past; I don't mean to deny you credit, and I certainly value your work, but I'm not sure where the threshold is for adding a copyright line. Roland, care to comment here? > +static void srp_remove_target(struct srp_target_port *target) > { > srp_del_scsi_host_attr(target->scsi_host); > + cancel_work_sync(&target->block_work); > + mutex_lock(&target->mutex); > + mutex_unlock(&target->mutex); You lock and unlock here without doing anything in the critical section. > static int srp_reset_host(struct scsi_cmnd *scmnd) > { > struct srp_target_port *target = host_to_target(scmnd->device->host); > + struct srp_host *host = target->srp_host; > int ret = FAILED; > + bool remove = false; > > shost_printk(KERN_ERR, target->scsi_host, PFX "SRP reset_host called\n"); > > - if (!srp_reconnect_target(target)) > + if (!srp_reconnect_target(target)) { > ret = SUCCESS; > + } else { > + /* > + * We couldn't reconnect, so kill our target port off. > + * However, we have to defer the real removal because we > + * are in the context of the SCSI error handler now, which > + * will deadlock if we call scsi_remove_host(). Part of me wonders if killing the SCSI host on a failed reconnect is the correct thing to do. Certainly, the checks in scsi_eh_test_devices() are going to kick the drives to offline state, but the FC transport blocks the error handling to avoid that. I think we need to give our error behavior a bit more thought. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html