Recovering from transaction errors [was: Re: [syzbot] INFO: rcu detected stall in tx]

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 20 May 2021 17:05:06 -0400

On Thu, May 20, 2021 at 08:30:01PM +0000, Thinh Nguyen wrote:
> Hm... looks like we have a couple of issues in the uas storage class
> driver and the xhci driver.
> 
> We may need to fix that in the uas storage driver because it doesn't
> seem to handle it. (check uas_data_cmplt() in uas.c).

Hmmm.  I see that if there is an error, the driver assumes no data was 
transferred.  It shouldn't make that assumption; it should always use 
urb->actual_length.

It also doesn't indicate to the SCSI layer when a command gets an error 
and it doesn't try to do any recovery.  Of course, the SCSI error 
handler may initiate some recovery actions.

> As for the xhci driver, there maybe a case where the stream URB never
> gets to complete because the transaction err_count is not properly
> updated. The err_count for transaction error is stored in ep_ring, but
> the xhci driver may not be able to lookup the correct ep_ring based on
> TRB address for streams. There are cases for streams where the event
> TRBs have their TRB pointer field cleared to '0' (xhci spec section
> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> it automatically does a soft-retry. This is seen from one of our
> testings that the driver was repeatedly doing soft-retry until the class
> driver timed out.
> 
> Hi Mathias, maybe you have some comment on this? Thanks.
> 
> > 
> > The fact is that only a small percentage of -EPROTO errors are 
> > recoverable.  Some of them can be handled by a port reset, which can be 
> > pretty awkward to perform but does occasionally work.  A lot of them 
> > occur because the USB cable has been unplugged; obviously there's no way 
> > to recover from that.  With only a few exceptions, the best and simplest 
> > approach is not to try to recover at all.
> 
> If the cable is unplugged, then we should get a connection change event
> and the driver can handle it properly.

Yes -- unless the driver is in such a tight retry loop that the rest of 
the system never gets a chance to process the connection change event.  
I've seen bug reports where that happened.

> Yes, it's probably simplest to do a port reset and let the transfer be
> incomplete/corrupted. However, I think we should give
> ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
> mechanism for storage class driver, even though that it may not be
> foolproof.

When you say storage class driver, which one are you talking about,
usb-storage or uas?  usb-storage already has a pretty robust recovery 
mechanism.

> > For the case in question (the syzbot bug report that started this 
> > thread), the class driver doesn't try to perform any recovery.  It just 
> > resubmits the URB, getting into a tight retry loop which consumes too 
> > much CPU time.  Simply giving up would be preferable.
> > 
> > Alan Stern
> > 
> 
> I see. By giving up, you mean doing port reset right? Otherwise it needs
> some other mechanism to synchronize with the device side.

No, I mean the driver should just stop communicating with the device.  
That's an appropriate action for lots of drivers.  If the user wants to 
re-synchronize with the device, he can unplug the USB cable and plug it 
back in again.

Alan Stern