Re: [PATCH 04/12] IB/srp: Fix connection state tracking

Doug Ledford <dledford@xxxxxxxxxx> · Tue, 05 May 2015 10:10:29 -0400

On Tue, 2015-05-05 at 11:21 +0200, Bart Van Assche wrote:
> On 04/30/15 18:08, Doug Ledford wrote:
> > On Thu, 2015-04-30 at 10:58 +0200, Bart Van Assche wrote:
> >> @@ -2367,7 +2368,7 @@ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
> >>   	case IB_CM_DREQ_RECEIVED:
> >>   		shost_printk(KERN_WARNING, target->scsi_host,
> >>   			     PFX "DREQ received - connection closed\n");
> >> -		srp_change_conn_state(target, false);
> >> +		ch->connected = false;
> >
> > So, in this patch, you modify disconnect to set srp_change_conn_state()
> > to false for the target, then loop through all the channels sending
> > cm_dreq's, and on the receiving side, you modify the cm_dreq handler to
> > set each channel to false.  However, once you get to 0 channels open,
> > shouldn't you then set the target state to false too just to keep things
> > consistent?
> 
> Hello Doug,
> 
> What is not visible in this patch but only in the ib_srp.c source code 
> is that the first received DREQ initiates a reconnect (the queue_work() 
> call below):
> 
> 	case IB_CM_DREQ_RECEIVED:
> 		shost_printk(KERN_WARNING, target->scsi_host,
> 			     PFX "DREQ received - connection closed\n");
> 		ch->connected = false;
> 		if (ib_send_cm_drep(cm_id, NULL, 0))
> 			shost_printk(KERN_ERR, target->scsi_host,
> 				     PFX "Sending CM DREP failed\n");
> 		queue_work(system_long_wq, &target->tl_err_work);
> 		break;
> 
> That should be sufficient to restore communication after a DREQ has been 
> received.

Sure, but there is no guarantee that the wq is not busy with something
else, or that the reconnect attempt will succeed.  So, it would seem to
me that if you want to make sure your internal driver state is always
consistent, you should set the device connected state to 0 when there
are no connected channels any more.

However, while looking through the driver to research this, I noticed
something else that seems more important if you ask me.  With this patch
we now implement individual channel connection tracking.  However, in
srp_queuecommand() you pick the channel based on the tag, and the blk
layer has no idea of these disconnects, so the blk layer is free to
assign a tag/channel to a channel that's disconnected, and then as best
I can tell, you will simply try to post a work request to a channel
that's already disconnected, which I would expect to fail if we have
already disconnected this particular qp and not brought up a new one
yet.  So it seems to me there is a race condition between new incoming
SCSI commands and this disconnect/reconnect window, and that maybe we
should be sending these commands back to the mid layer for requeueing
when the channel the blk_mq tag points to is disconnected.  Or am I
missing something in there?

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: 0E572FDD

Attachment:
signature.asc

Description: This is a digitally signed message part