Re: Deadlock in usb-storage error handling

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 20 Mar 2014 13:26:59 -0700

On Thu, 2014-03-20 at 15:48 -0400, Alan Stern wrote:
> On Thu, 20 Mar 2014, James Bottomley wrote:
> 
> > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote:
> > > On Thu, 20 Mar 2014, James Bottomley wrote:
> > > 
> > > > OK, so I think we have three things to do
> > > > 
> > > >      1. Investigate SCSI and fix it's abort state problem that's causing
> > > >         it not to send the abort second time around
> > > >      2. Fix usb-storage to fail a reset it can't do (i.e. device reset
> > > >         with outstanding commands)
> > > >      3. Find out why we're sending a spurious request sense.
> > > > 
> > > > I can look at 1 and 3 if you want to take 2.
> > > 
> > > It's a deal!  Thanks for your help.
> > 
> > And this looks to be 3: a bug in the way we attach sense data to
> > commands (we shouldn't look for attached sense if the device error code
> > didn't imply there would be any).
> > 
> > James
> > 
> > ---
> > 
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 771c16b..d020149 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q,
> >  					     __func__));
> >  			break;
> >  		}
> > +		if (status_byte(scmd->result) != CHECK_CONDITION)
> > +			/*
> > +			 * don't request sense if there's no check condition
> > +			 * status because the error we're processing isn't one
> > +			 * that has a sense code (and some devices get
> > +			 * confused by sense requests out of the blue)
> > +			 */
> > +			continue;
> > +
> >  		SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd,
> >  						  "%s: requesting sense\n",
> >  						  current->comm));
> 
> I tried this patch first, because fixing the earlier bug would mask
> this one.
> 
> The patch sort of worked.  But the first time I tried it, it failed in
> a rather amusing way.  While the second retry was running and hung,
> scmd->result _was_ equal to CHECK_CONDITION -- because that was the
> result from the _first_ retry, and it had never gotten cleared!
> 
> scmd->result needs to be set to 0 before the queuecommand callback is
> invoked.  I ended up adding this to your patch, and then it worked
> perfectly:

Wow, the stale data bugs are just crawling out of the code.  Thanks for
checking.

> 
> Index: usb-3.14/drivers/scsi/scsi_error.c
> ===================================================================
> --- usb-3.14.orig/drivers/scsi/scsi_error.c
> +++ usb-3.14/drivers/scsi/scsi_error.c
> @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd
>  	memset(scmd->cmnd, 0, BLK_MAX_CDB);
>  	memset(&scmd->sdb, 0, sizeof(scmd->sdb));
>  	scmd->request->next_rq = NULL;
> +	scmd->result = 0;
>  
>  	if (sense_bytes) {
>  		scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE,
> Index: usb-3.14/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-3.14.orig/drivers/scsi/scsi_lib.c
> +++ usb-3.14/drivers/scsi/scsi_lib.c
> @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s
>  	 * lock such that the kblockd_schedule_work() call happens
>  	 * before blk_cleanup_queue() finishes.
>  	 */
> +	cmd->result = 0;
>  	spin_lock_irqsave(q->queue_lock, flags);
>  	blk_requeue_request(q, cmd->request);
>  	kblockd_schedule_work(q, &device->requeue_work);
> 
> 
> Maybe only the second one is necessary, but it seemed best to be
> consistent.

Thanks, I'll add this one to the list as well and see if we can get it
into the merge window.  I take it you'd like a cc to stable on these
three?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html