Re: xhci problem -> general protection fault

Ross Zwisler <zwisler@xxxxxxxxxx> · Tue, 8 Dec 2020 10:24:40 -0700

On Fri, Dec 04, 2020 at 08:07:30PM +0200, Mathias Nyman wrote:
<>
> Ok, thanks.
> 
> Then the rootcause remains unknown.
> For some reason the endpoint context dequeue pointer field contains zero
> instead of the new dequeue pointer.
> The (output) endpoint context is supposed to be written only by the controller.
> 
> Time to change strategy and start to detect and treat the symptoms instead.
> 
> I wrote a patch that detects the 0-dequeue pointer and issues a
> new Set TR Deq pointer command. Hopefully that works.
> patch added to same branch, can you try it out?
> 
> 3f6326766abc xhci: retry setting new dequeue if xHC hardware failed to update it
> 
> I didn't set a retry limit yet so if it doesn't work it might retry forever.

Here are some logs when running with that commit:

https://gist.github.com/rzwisler/17923c9dedf2b914254eadd1cd294a4c

I think we only consistently get the clean failure case with the dequeue
pointer being 0 if CONFIG_INTEL_IOMMU_DEFAULT_ON=y.

If that option is set to 'n', we get the same failure where the xHCI
controller totally dies (log "CONFIG_INTEL_IOMMU_DEFAULT_ON=n" in the gist).

With CONFIG_INTEL_IOMMU_DEFAULT_ON=y we do seem to live through multiple
errors, but as soon as I try to use the device normally afterwards it seems to
spin forever with these messages:

xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fff0a330 trb-start 00000000f8884000 trb-end 0000000000000000 seg-start 00000000f8884000 seg-end 00000000f8884ff0

Are you able to reproduce this with Andrzej's bulk-cancel script?  I think you
probably just need a device which accepts bulk transfer commands?  In my most
recent reproductions my servo hardware wasn't even attached to a device, so I
don't really think it's doing anything except sitting there and receiving
BULK_IN commands.   I'm doing this to two devices simultaneously.