On 28.6.2022 10.51, Jörg Rödel wrote:
Hi Mathias,
thanks a lot for your reply.
On Mon, Jun 27, 2022 at 01:52:44PM +0300, Mathias Nyman wrote:
Add to kernel cmdline:
xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M
<boot>
mount -t debugfs none /sys/kernel/debug
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace
Also if you could dump the content of following registers:
cat /sys/kernel/debug/usb/xhci/<pci address>/reg-op
cat /sys/kernel/debug/usb/xhci/<pci address>/reg-runtime
xhci driver writes dma addresses it allocated for the host into
some of those registers
The data you requested is in the attached archive. The trace was
actually empty. My kernel has tracing enabled in general, do I need to
enable more config options to get the trace?
Thanks, not sure about which tracing option is missing.
You might be right about mangling DMA addresses above 32bit.
At least that address looks a lot like the command ring DMA address with a
higher byte zeroed.
Driver allocates and maps memory for the command ring at DMA 0xffffffffffefe000:
[ 6.698051] xhci_hcd 0000:02:00.0: First segment DMA is 0xffffffffffefe000
[ 6.698053] xhci_hcd 0000:02:00.0: // Setting command ring address to 0xffffffffffefe001
That address is stored in a 64 bit CRCR register (mmio at xhci->op_regs->cmd_ring)
by writing low 32 bits first, then high 32 bits.
The AMD-Vi IO_PAGE_FAULT for accessing 0xff00ffffffefe000 comes right after driver
asked xHC to process commands from the command ring (the Ding dong! message)
xHC device reads the commands from the DMA address written to the CRCR register.
The faulty address has the same lower 32 bits as the command ring we allocated.
[ 6.880028] xhci_hcd 0000:02:00.0: // Ding dong!
[ 6.880045] xhci_hcd 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xff00ffffffefe000 flags=0x0000]
Can't find a reason for zeroed bits in the upper 32 bit part.
Looks like we always write all 64 bits to CRCR register.
Maybe a race where command ring is started while writing high 32 bits to CRCR?
Writing address bits to CRCR register are ignored if command ring is running.
Also looks like roothub might be runtime suspending just before the issue.
Anyway, maybe flushing the CRCR register by reading it back after writing it would help:
---
diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 8c19e151a945..e54829898a2c 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -2473,6 +2473,8 @@ int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags)
"// Setting command ring address to 0x%016llx", val_64);
xhci_write_64(xhci, val_64, &xhci->op_regs->cmd_ring);
+ xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
+
/* Reserve one command ring TRB for disabling LPM.
* Since the USB core grabs the shared usb_bus bandwidth mutex before
* disabling LPM, we only need to reserve one TRB for all devices.
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 46d0b9ad6f74..721715ba5085 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -390,6 +390,9 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
crcr = xhci_trb_virt_to_dma(new_seg, new_deq);
xhci_write_64(xhci, crcr | CMD_RING_ABORT, &xhci->op_regs->cmd_ring);
+ xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
+
/* Section 4.6.1.2 of xHCI 1.0 spec says software should also time the
* completion of the Command Abort operation. If CRR is not negated in 5
* seconds then driver handles it as if host died (-ENODEV).
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 9ac56e9ffc64..6180a721693d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -854,6 +854,8 @@ static void xhci_set_cmd_ring_deq(struct xhci_hcd *xhci)
"// Setting command ring address to 0x%llx",
(long unsigned long) val_64);
xhci_write_64(xhci, val_64, &xhci->op_regs->cmd_ring);
+
+ xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
}
/*