Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 10 Apr 2014 13:36:11 -0700

On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
> On 04/10/2014 05:31 PM, Alan Stern wrote:
> > On Thu, 10 Apr 2014, Hannes Reinecke wrote:
> >
> >> On 04/10/2014 12:58 PM, Andreas Reis wrote:
> >>> That patch appears to work in preventing the crashes, judged on one
> >>> repeated appearance of the bug.
> >>>
> >>> dmesg had the usual
> >>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> >>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> >>> xhci_hcd
> >>> [  215.350296] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep ffff880427b829c0
> >>> [  215.350305] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep ffff880427b82a08
> >>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> >>>
> >>> repeated five times, followed by one
> >>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> >>> recovery
> >>>
> >>> and then as often as something tried to read from it:
> >>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> >>>
> >>> The stick could then be properly un- and remounted (the latter if it
> >>> had been physically replugged) without issue � for the bug to
> >>> reoccur after one to three minutes. I tried this three times, no
> >>> dmesg difference except the ep addresses varied on two of that.
> >>>
> >> Was this just that patch you've tested with or the entire patch series?
> >>
> >> If the latter, Alan, is this the expected outcome?
> >
> > Yes, it is.  The same thing should happen with the entire patch series.
> >
> >> I would've thought the error recover should _not_ run into
> >> offlining devices here, but rather the device should be recovered
> >> eventually.
> >
> > The command times out, it is aborted, and the command is retried.  The
> > same thing happens, and we repeat five times.  Eventually the SCSI core
> > gives up and declares the device to be offline.
> >
> Hmm. Ok. If you are fine with it who am I to argue here.
> James, shall I resent the patch series?

You mean the one patch?  No, it's OK, I have it.

It's still not complete, though, as I've said a couple of times.  The
problem is that we have abort memory on any eh command as well, which
this doesn't fix.

The scenario is abort command, set flag, abort completes, send TUR, TUR
doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
will skip the abort because the flag is set and move straight to reset.

The fix is this, I can just add it as well.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..7516e2c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
 	ses->prot_op = scmd->prot_op;
 
 	scmd->prot_op = SCSI_PROT_NORMAL;
+	scmd->eh_eflags = 0;
 	scmd->cmnd = ses->eh_cmnd;
 	memset(scmd->cmnd, 0, BLK_MAX_CDB);
 	memset(&scmd->sdb, 0, sizeof(scmd->sdb));


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html