Re: [PATCH] fc transport: pre-emptively terminate i/o upon dev_loss_tmo timeout

Mike Christie <michaelc@xxxxxxxxxxx> · Mon, 08 Dec 2008 16:31:33 -0600

James Smart wrote:
Pre-emptively terminate i/o on the rport if dev_loss_tmo has fired.
The desire is to terminate everything, so that the i/o is cleaned up
prior to the sdev's being unblocked, thus any outstanding timeouts/aborts
are avoided.

Also, we do this early enough such that the rport's port_id field is
still valid. FCOE libFC code needs this info to find the i/o's to
terminate.

-- james s

 Signed-off-by: James Smart <james.smart@xxxxxxxxxx>

 ---

 scsi_transport_fc.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


diff -upNr a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c

--- a/drivers/scsi/scsi_transport_fc.c	2008-10-18 10:32:52.000000000 -0400
+++ b/drivers/scsi/scsi_transport_fc.c	2008-12-05 12:13:54.000000000 -0500
@@ -3012,6 +3012,16 @@ fc_timeout_deleted_rport(struct work_str
 	rport->port_state = FC_PORTSTATE_NOTPRESENT;
 	rport->flags &= ~FC_RPORT_FAST_FAIL_TIMEDOUT;
 
+	/*
+	 * Pre-emptively kill I/O rather than waiting for the work queue
+	 * item to teardown the starget. (FCOE libFC folks prefer this
+	 * and to have the rport_port_id still set when it's done).
+	 */
+	spin_unlock_irqrestore(shost->host_lock, flags);
+	fc_terminate_rport_io(rport);
+
+	BUG_ON(rport->port_state != FC_PORTSTATE_NOTPRESENT);
+

I think we could this this bug_on or we might hit a problem below where 
this thread is changing the port_id but some other thread is calling 
fc_remote_port_add and changging the port id.

I think the problem is that fc_remote_port_add only calls fc_flush_work 
which flushes the fc_host work_q:


struct fc_rport *
fc_remote_port_add(struct Scsi_Host *shost, int channel,
        struct fc_rport_identifiers  *ids)
{
        struct fc_internal *fci = to_fc_internal(shost->transportt);
        struct fc_host_attrs *fc_host = shost_to_fc_host(shost);
        struct fc_rport *rport;
        unsigned long flags;
        int match = 0;

        /* ensure any stgt delete functions are done */
        fc_flush_work(shost);


But fc_timeout_deleted_rport dev_loss_work is running on the devloss_work_q.

So if fc_timeout_deleted_rport grabs the host lock first, it will set 
the port state to FC_PORTSTATE_NOTPRESENT, then drop the lock.

Once fc_timeout_delete_rport drops the lock, fc_remote_port_add could 
have already passed the fc_flush_work() call because there was nothing 
on it. fc_remote_port_add would then drop down to the list search on the 
bindings array and see the remote port. It would then start setting the 
port state, id and port_name and node_name, but at the same time, 
because the host lock no longer guards it, fc_timedout_deleted_rport 
could be fiddling with the same fields and we could end up a mix of 
values or it could be running over the BUG_ON.

Is that right?

If so do we just want to flush the devloss_work_queue in fc_remote_port_add?



 	/* remove the identifiers that aren't used in the consisting binding */
 	switch (fc_host->tgtid_bind_type) {
 	case FC_TGTID_BIND_BY_WWPN:
@@ -3035,7 +3045,6 @@ fc_timeout_deleted_rport(struct work_str
 	 * went away and didn't come back - we'll remove
 	 * all attached scsi devices.
 	 */
-	spin_unlock_irqrestore(shost->host_lock, flags);
 
 	scsi_target_unblock(&rport->dev);
 	fc_queue_work(shost, &rport->stgt_delete_work);


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html