Re: [PATCH 12/14] ib_srp: Rework error handling

David Dillow <dillowda@xxxxxxxx> · Mon, 19 Dec 2011 17:51:23 -0500

On Mon, 2011-12-19 at 05:38 -0500, Bart Van Assche wrote:
> On Mon, Dec 19, 2011 at 3:36 AM, David Dillow <dillowda@xxxxxxxx> wrote:
> > On Thu, 2011-12-01 at 20:10 +0100, Bart Van Assche wrote:
> >> Rework ib_srp transport layer error handling. Instead of letting SCSI
> >> commands time out if a transport layer error occurs,
> >
> > This is good, but should probably be part of the initial disconnect. We
> > want to run the receive queue dry, processing responses to avoid
> > unnecessarily resetting a command that was successful.
> 
> Blocking the SCSI host doesn't prevent ib_srp from continuing to
> process IB completion notifications.

In your series, it doesn't -- and I was wrong about the message spam,
you check for the proper state and the work being queued.

I haven't parsed it all out from your changes just yet, but I think part
of the reason you may have had problems with req->scmd being null in
srp_handle_recv() is due to a new race between the tear down of the
connection and continuing to process completion notifications.

> >> block the SCSI
> >> host and try to reconnect until the reconnect timeout elapses or until
> >> the maximum number of reconnect attempts has been exceeded, whichever
> >> happens first.
> >
> > When we're blocked for a disconnected target, we may want to call the
> > state something other than SRP_TARGET_BLOCKED. I'd like to eventually
> > handle the case where the target leaves the fabric temporarily -- we
> > don't necessarily disconnect in that case, but we need to block commands
> > until it comes back or we give up on it.
> 
> The case where the target leaves the fabric temporarily should already
> be covered by this patch series.

Only once a command is sent to it and timesout or we get a QP error. The
idea here would be to start the dev_loss_tmo timer once it was detected
that it left the fabric. This was handled in OFED via sysfs attributes,
but it seems like we could register for the events ourselves.

This isn't something I expect you to implement, but I'd like to leave
room for it in the future.

> >> +static void srp_remove_target(struct srp_target_port *target)
> >>  {
> >
> >>       srp_del_scsi_host_attr(target->scsi_host);
> >> +     cancel_work_sync(&target->block_work);
> >> +     mutex_lock(&target->mutex);
> >> +     mutex_unlock(&target->mutex);
> >
> > You lock and unlock here without doing anything in the critical section.
> 
> That's on purpose.

I'm assuming you do this to ensure that everyone has seen the
appropriate state and exited the critical section, then? Best to add a
comment to that effect. Actually, is there any reason to unlock it again
since we're removing the target? I don't have the code in front of me
ATM, so I'm assuming that you don't call any other routines that need
the lock after this. sparse may complain, though.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html