Re: Reproducible deadlock when usb-storage scsi command timeouts twice

Benjamin Block <bblock@xxxxxxxxxxxxx> · Wed, 3 May 2023 12:51:37 +0000

On Wed, May 03, 2023 at 12:55:03PM +0200, Oliver Neukum wrote:
> On 03.05.23 12:24, Benjamin Block wrote:
> > On Wed, Apr 26, 2023 at 03:20:07PM -0400, Alan Stern wrote:
> 
> >  From a cursory look at the logs above, SCSI ML does just try that:
> > 
> >>> [  218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled
> >>> [  218.109297] sd 0:0:0:0: [sda] tag#0 aborting command
> > 
> > calls `hostt->eh_abort_handler()` on the late request, and retries it
> > after success.
> > 
> >>> [  218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command
> >>> [  225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed
> > 
> > but it times out again, then we go straight into EH:
> 
> And that is problematic to usb-storage
> > 
> >>> [  225.129337] scsi host0: Waking error handler thread
> >>> [  225.129358] scsi host0: scsi_eh_0: waking up 0/1/1
> >>> [  225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1
> >>> [  225.129387] scsi host0: Total of 1 commands on 1 devices require eh work
> >>> [  225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR
> > 
> > IIRC in the past we used to call Abort a second time from within the EH
> > thread before trying the different resets, but that was removed at some
> > point a couple of years ago. 

Seems like I misremembered. Can't find the commit I was thinking
happened, and the only thing that changed was that aborts moved out of
the EH thread and be asynchronous.

> > Now we add the command straight to the EH
> > list, and start with the TMF LUN reset, which ought to implicitly abort
> > the command as well on the target.
> 
> usb-storage can do a reset only on the USB device level,
> which translates to a bus reset on the SCSI level.
> 
> And we are supposed to cancel any communication with the device
> before that.

Is that a limitation of the devices or drivers? Because then you don't
match SCSI semantics for LU reset - which aborts all running commands
on that scope among things. Which might explain the reason/choice behind
this unexpected behavior for you.

On random thought I had: in theory you could implement your own EH
strategy handler if the default one doesn't work for you. ATA and SAS do so.
[drivers/scsi/scsi_error.c:2285 `shost->transportt->eh_strategy_handler()`]
This can re-use parts/all of the existing escalation sequence in
`scsi_eh_ready_devs()`.

But that's no short-term fix.

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Gregor Pillen         /         Geschäftsführung: David Faller
Sitz der Ges.: Böblingen     /    Registergericht: AmtsG Stuttgart, HRB 243294