On 1/15/21 7:50 PM, David Laight wrote: > From: Sergei Shtylyov >> Sent: 15 January 2021 16:40 >> >> On 1/15/21 7:19 PM, Mathias Nyman wrote: >> >>> Once the command ring doorbell is rung the xHC controller will parse all >>> command TRBs on the command ring that have the cycle bit set properly. >>> >>> If the driver just started writing the next command TRB to the ring when >>> hardware finished the previous TRB, then HW might fetch an incomplete TRB >>> as long as its cycle bit set correctly. >>> >>> A command TRB is 16 bytes (128 bits) long. >>> Driver writes the command TRB in four 32 bit chunks, with the chunk >>> containing the cycle bit last. This does however not guarantee that >>> chunks actually get written in that order. >>> >>> This was detected in stress testing when canceling URBs with several >>> connected USB devices. >>> Two consecutive "Set TR Dequeue pointer" commands got queued right >>> after each other, and the second one was only partially written when >>> the controller parsed it, causing the dequeue pointer to be set >>> to bogus values. This was seen as error messages: >>> >>> "Mismatch between completed Set TR Deq Ptr command & xHCI internal state" >>> >>> Solution is to add a write memory barrier before writing the cycle bit. >>> >>> Cc: <stable@xxxxxxxxxxxxxxx> >>> Tested-by: Ross Zwisler <zwisler@xxxxxxxxxx> >>> Signed-off-by: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> >>> --- >>> drivers/usb/host/xhci-ring.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c >>> index 5677b81c0915..cf0c93a90200 100644 >>> --- a/drivers/usb/host/xhci-ring.c >>> +++ b/drivers/usb/host/xhci-ring.c >>> @@ -2931,6 +2931,8 @@ static void queue_trb(struct xhci_hcd *xhci, struct xhci_ring *ring, >>> trb->field[0] = cpu_to_le32(field1); >>> trb->field[1] = cpu_to_le32(field2); >>> trb->field[2] = cpu_to_le32(field3); >>> + /* make sure TRB is fully written before giving it to the controller */ >>> + wmb(); >> >> Have you tried the lighter barrier, dma_wmb()? IIRC, it exists for these exact cases... > > Isn't dma_wmb() needed between the last memory write and the io_write to the doorbell? No. > Here we need to ensure the two memory writes aren't re-ordered. No, we need all 3 ring memory writes to be ordered such that they all happen before the 4th write. It's not wonder this bug hasn't been noticed before -- x86 has strong write ordering unlike ARM/etc. > Apart from alpha isn't a barrier() likely to be enough for that. Not sure -- we don't have any barriers before the equivalents of a doorbell write in e.g. the Renesas Ehter driver. > It is worth checking that the failing compiles didn't have the writes reordered. The writes are reordered not because of the compiler -- the read/write reordering is a CPU feature (on at least non-x86). :-) > David > >> >>> trb->field[3] = cpu_to_le32(field4); >>> >>> trace_xhci_queue_trb(ring, trb); MBR, Sergei