On Tue, 2015-05-05 at 11:21 +0200, Bart Van Assche wrote: > On 04/30/15 18:08, Doug Ledford wrote: > > On Thu, 2015-04-30 at 10:58 +0200, Bart Van Assche wrote: > >> @@ -2367,7 +2368,7 @@ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) > >> case IB_CM_DREQ_RECEIVED: > >> shost_printk(KERN_WARNING, target->scsi_host, > >> PFX "DREQ received - connection closed\n"); > >> - srp_change_conn_state(target, false); > >> + ch->connected = false; > > > > So, in this patch, you modify disconnect to set srp_change_conn_state() > > to false for the target, then loop through all the channels sending > > cm_dreq's, and on the receiving side, you modify the cm_dreq handler to > > set each channel to false. However, once you get to 0 channels open, > > shouldn't you then set the target state to false too just to keep things > > consistent? > > Hello Doug, > > What is not visible in this patch but only in the ib_srp.c source code > is that the first received DREQ initiates a reconnect (the queue_work() > call below): > > case IB_CM_DREQ_RECEIVED: > shost_printk(KERN_WARNING, target->scsi_host, > PFX "DREQ received - connection closed\n"); > ch->connected = false; > if (ib_send_cm_drep(cm_id, NULL, 0)) > shost_printk(KERN_ERR, target->scsi_host, > PFX "Sending CM DREP failed\n"); > queue_work(system_long_wq, &target->tl_err_work); > break; > > That should be sufficient to restore communication after a DREQ has been > received. Sure, but there is no guarantee that the wq is not busy with something else, or that the reconnect attempt will succeed. So, it would seem to me that if you want to make sure your internal driver state is always consistent, you should set the device connected state to 0 when there are no connected channels any more. However, while looking through the driver to research this, I noticed something else that seems more important if you ask me. With this patch we now implement individual channel connection tracking. However, in srp_queuecommand() you pick the channel based on the tag, and the blk layer has no idea of these disconnects, so the blk layer is free to assign a tag/channel to a channel that's disconnected, and then as best I can tell, you will simply try to post a work request to a channel that's already disconnected, which I would expect to fail if we have already disconnected this particular qp and not brought up a new one yet. So it seems to me there is a race condition between new incoming SCSI commands and this disconnect/reconnect window, and that maybe we should be sending these commands back to the mid layer for requeueing when the channel the blk_mq tag points to is disconnected. Or am I missing something in there? -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: 0E572FDD
Attachment:
signature.asc
Description: This is a digitally signed message part