On Thu, Jan 07, 2021 at 10:57:19AM +0200, Mathias Nyman wrote: > On 6.1.2021 20.52, Ross Zwisler wrote: > > On Wed, Dec 30, 2020 at 5:31 AM Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> wrote: > > <> > >> I was able to reproduce the issue with two USB devices on a different system. > >> > >> I saw that the new incorrect dequeue pointer sometimes had the higher 32 bits > >> incorrect while the lower bits were correct. > >> So this looks like a memory access order issue. > >> > >> The command TRB is 16 bytes. The xhci driver writes it in four 4 byte chunks. > >> Even if the driver writes the chunk that sets the cycle bit last, handing the TRB > >> over to the controller, it appears that the actual write order can be different. > >> The controller ends up reading a command TRB with updated cycle bit but old bogus > >> values in the "new dequeue pointer" field. > >> > >> Adding a write memory barrier before writing the cycle bit solved the issue in my case. > >> > >> Whole patch series is updated, added write memory barrier, and rebased on 5.10. > >> It can be found force-updated in the same rewrite_halt_stop_handling branch: > >> > >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git rewrite_halt_stop_handling > >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=rewrite_halt_stop_handling > >> > >> Does this work for you? > > > > Yes, it does!! I verified that I'm able to reproduce the issue in less than > > 10 seconds with this commit which is the HEAD~1 of your series: > > > > a7d053d207121 xhci: handle halting transfer event properly after endpoint stop and halt raced. > > > > With this commit (the final commit in your series, adding a single patch): > > > > 96887d191a88c xhci: make sure TRB is fully written before giving it to the controller > > > > I ran cleanly for over 20 minutes and haven't been able to reproduce the > > issue. It looks like the wmb() added in that patch makes all the difference! > > > > Thank you for the fix, and you can add this tag to the series: > > > > Tested-by: Ross Zwisler <zwisler@xxxxxxxxxx> > > > > Great, thanks for testing. > > I think it makes sense to cherry pick that one-liner wmb() patch first to 5.11 and stable, > then add rest of the rewrite later. > > -Mathias Makes sense to me. FWIW I've verified that v5.10 reproduces the issue easily, and v5.10 with just the wmb() patch does not. Thanks again for the fix!