re: [PROBLEM] usb: xhci_bus_resume cause irq lost issue

liudingyuan <liudingyuan@xxxxxxxxxx> · Thu, 6 Mar 2025 14:29:03 +0000

Hi

> > On 5.3.2025 12.07, liudingyuan wrote:
> > 
> > 
> > Hi
> > 
> > I'm running into an issue where the enumeration of a USB2.0 device failed due to lost interrupts.
> > 
> > This issue appears to occur randomly and we can only reproduce it on xHCI controllers that provide both USB3.0 and USB2.0 root hubs. Additionally, it is necessary to ensure that the first-level device under this controller is a SB2.0 device.
> > The above scenario can be referred to in the following figure.
> >     ----------------------------------------------------------------------------
> >    |         +---------------------------------+               |
> >    |         |    xHCI Controller    |               |
> >    |         +---------------------------------+               |
> >    |                /       \                      |
> >    |               /         \                     |
> >    |              /           \                    |
> >    |  +-------------------------+      +---------------------------+   |
> >    |  | USB 3.0 Root Hub |      | USB 2.0 Root Hub  |   |
> >    |  +------------------------+      +----------------------------+   |
> >    --------------|-------------------------------------|-------------------------
> >            |                       |
> >            | [NO device]             | [Device A]
> >            |                       |
> >                               
> >                                 
> > 
> > The USB topology displayed in the OS looks like this:

   /:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 10000M
       ID 1d6b:0003 Linux Foundation 3.0 root hub
> > /:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/ , 480M
> >      ID 1d6b:0002 Linux Foundation 2.0 root hub
> >      |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
> > 
> > /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/ , 5000M
> >      ID 1d6b:0003 Linux Foundation 3.0 root hub

> Odd that the USB3 roothub is registered before USB2 roothub, I thought xhci driver always registers USB2 hcd first .

Yes, this was my mistake. After checking, it turns out that the bus3 and bus4 are two ports belonging to the same xHCI controller.
Information about bus4 has been added as part of the modification.
The failed USB15 in the following log is a USB2 port. The resume of this port has a conflict with the USB3 port USB16.
(In the scene below, this resume was due to the time condition below not being met.)
	/drivers/usb/core/hcd.c
		hcd_bus_suspend() 
		{

			status = hcd->driver->hub_status_data(hcd, buffer)
			drivers/usb/host/xhci-hub.c
			xhci_hub_status_data() 
			{
				if (time_before(jiffies, xhci->run_graceperiod)) 
					status = 1;
			}

			if (status != 0) {
				dev_dbg(&rhdev->dev, "suspend raced with wakeup event\n");
				hcd_bus_resume(rhdev, PMSG_AUTO_RESUME);
				status = -EBUSY;
			}
		}

> > 
> > This issue only occurs when the system reboot or when we insmod the xhci driver or ingister the xhci controller.
> > When the issue occurs, we can observe that the CPU receives fewer interrupts than what would normally be generated during the enumeration process, and there are error logs indicating command timeouts.
> > 	[ 2040.039438]xhci_hcd 0000:8a:00.7: Command timeout, USBSTS: 0x00000018 EINT PCD
> > 	[ 2040.039444] xhci_hcd 0000:8a:00.7: Command timeout
> > 	[ 2040.039446] xhci_hcd 0000:8a:00.7: Abort command ring
> > 	[ 2042.055435] xhci_hcd 0000:8a:00.7: No stop event for abort, ring start fail?
> > 	[ 2042.055469] xhci_hcd 0000:8a:00.7: Timeout while waiting for setup device command
> > 	[ 2042.064048] usb 15-1: hub failed to enable device, error -62
> > 	[ 2054.343446] xhci_hcd 0000:8a:00.7: Unsuccessful disable slot 1 command, status 25
> >   	[ 2066.631449] xhci_hcd 0000:8a:00.7: Error while assigning device slot ID: Command Aborted
> > 	[ 2066.640633] xhci_hcd 0000:8a:00.7: Max number of devices this xHCI host supports is 64.
> > 	[ 2066.649713] usb usb15-port1: couldn't allocate usb_device
> > 
> > After verification, we can confirm that the reason for the interrupt 
> > loss is that during the enumeration of U2 device,
> > U3 port is in a suspend procedure and performs an operation to disable interrupts in this function:
> > 
> > 	drivers/usb/host/xhci-hub.c
> > 		xhci_bus_resume()
> > 			/* delay the irqs */
> > 			temp = readl(&xhci->op_regs->command);
> > 			temp &= ~CMD_EIE;
> > 			writel(temp, &xhci->op_regs->command);
> > 
> > we can temporarily avoid this issue by modifying parameters.
> > echo -1 > /sys/module/usbcore/parameters/autosuspend
> > 
> > I am wondering whether there is a chance of interrupt loss occurring, regardless of whether or not it belongs to the scenario mentioned above? As long as an interrupt from a controller is triggered at exactly the same time as the process of disabling the controller's interrupts?
> > 
> > I read the xHCI protocol version 1.2 and haven't found detailed descriptions for such special cases. So I was wondering what is the main reason for disabling interrupts in xHCI driver during the resume process?
> > 

> This is from a time before I started maintaining the xhci driver. I guess it was done to allow bus suspended usb2 ports to resume fully to U0 before
> xhci_bus_resume() returns.

> Resuming a usb2 port from  U3 suspend to U0 is a two stage process.
> In 'host initiated resume' the xhci driver will first transition the port from 'U3' to 'Resume' state, then wait in Resume state for 20ms, and finally move it to U0 state.

> I assume driver disabled xHC from triggering interrupts to prevent event handler from messing with this usb2 port resume process.

> spin_lock_irqsave(xhci->lock) would normally be used to prevent interrupt handler from interfering, but keeping the spin lock over msleep(20) was not possible.

> The device initiated usb2 resume is much better, it utilizes the event handler, hub thread, timestamps and completions. The same should be done here, but implementation isn't trivial.

> I don't however think there is a reason to turn off interrupts while resuming the USB3 bus. It doesn't sleep so just keeping spin_lock_irqsave() should be enough. This should be an easy fix. Something like this:
> (untested, copy-pasted diff)

Thank you very much for your detailed analysis regarding the reason for disabling interrupts during the resume operation.

I compiled a new kernel based on the fix code you provided below and conducted some preliminary tests. 
In the repeated unregister/register tests of the xHCI controller that previously caused issues,  both the driver and USB-related functionalities are now working normally. 
(Moreover, this fix code, in theory, should completely resolve the issues we encountered in our USB3-USB2 device-only scenario.)

Based on the logic mentioned in analysis, we currently may not have implemented a better solution to avoid disabling interrupts during the USB2 resume process. I would like to ask if we need to
be concerned about the issue of interrupt loss caused by disabling interrupts in other scenarios where resume and enumeration processes or transfer operations might conflict?

For example, when a user inserts a device during the USB2/USB3 port resume, or when the USB3 controller is only connected to a USB3 devices, and the USB2 port enters this resume flow due to auto-suspend?
 (However, it seems that the probability of these two scenarios is very low, as we have not yet been able to reproduce errors under these conditions.)

> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c index 9693464c0520..7cf7ee84fc96 100644
> --- a/drivers/usb/host/xhci-hub.c
> +++ b/drivers/usb/host/xhci-hub.c
> @@ -1873,6 +1873,7 @@ int xhci_bus_resume(struct usb_hcd *hcd)
>          u32 temp, portsc;
>          struct xhci_hub *rhub;
>          struct xhci_port **ports;
> +       bool disabled_irq = false;
>   
>          rhub = xhci_get_rhub(hcd);
>          ports = rhub->ports;
> @@ -1888,17 +1889,19 @@ int xhci_bus_resume(struct usb_hcd *hcd)
>                  return -ESHUTDOWN;
>          }
>   
> -       /* delay the irqs */
> -       temp = readl(&xhci->op_regs->command);
> -       temp &= ~CMD_EIE;
> -       writel(temp, &xhci->op_regs->command);
> -
>          /* bus specific resume for ports we suspended at bus_suspend */
> -       if (hcd->speed >= HCD_USB3)
> +       if (hcd->speed >= HCD_USB3) {
>                  next_state = XDEV_U0;
> -       else
> +       } else {
>                  next_state = XDEV_RESUME;
> -
> +               if (bus_state->bus_suspended) {
> +                       /* delay the irqs if we need to resume usb2 ports */
> +                       temp = readl(&xhci->op_regs->command);
> +                       temp &= ~CMD_EIE;
> +                       writel(temp, &xhci->op_regs->command);
> +                       disabled_irq = true;
> +               }
> +       }
>          port_index = max_ports;
>          while (port_index--) {
>                  portsc = readl(ports[port_index]->addr); @@ -1967,10 +1970,12 @@ int xhci_bus_resume(struct usb_hcd *hcd)
>   
>         bus_state->next_statechange = jiffies + msecs_to_jiffies(5);
>          /* re-enable irqs */
> -       temp = readl(&xhci->op_regs->command);
> -       temp |= CMD_EIE;
> -       writel(temp, &xhci->op_regs->command);
> -       temp = readl(&xhci->op_regs->command);
> +       if (disabled_irq) {
> +               temp = readl(&xhci->op_regs->command);
> +               temp |= CMD_EIE;
> +               writel(temp, &xhci->op_regs->command);
> +               temp = readl(&xhci->op_regs->command);
> +       }
>   
>          spin_unlock_irqrestore(&xhci->lock, flags);
>          return 0;
>

This fix indeed helps us avoid the current issue, so I would like to ask if it is possible to push this modification as a patch to the mainline code?
If possible, we also plan to conduct a comprehensive test of USB functionality based on this modification to further validate it. 

Considering that the fix cannot completely avoid all possible scenarios where interrupts might be lost due to the hardware IE (Interrupt Enable) being turned off.
I would like to ask whether the hardware design is reasonable in the following case: 
when a hardware edge-triggered interrupt is lost due to IE being disabled, and the subsequent interrupts cannot be triggered because the software didn't
clear the EHB (Event Handler Busy) bit.  

Does XHCI Specificationl have any requirements regarding this? 

                     +--------------------------------+
                     |    xHCI Controller   |
                     |   (Shared Interrupt)  |
                     +--------------------------------+
                          /             \
                         /               \
                        /                 \
           +--------------------------------+   +------------------------------------+
           | USB 2.0 Roothub      |   | USB 3.0 Roothub         |
           | (Suspend in progress)  |   | (Device Enumeration)     |
           |  - IE disabled        |   |  - New device inserted    |
           |  - Events accumulating |   |  - Generates event TRBs  |
           +---------------------------------+   +------------------------------------+
                          \             /
                           \           /
                            \         /
                  +------------------------------------------+
                  | Shared Physical Interrupt Line  |
                  |   (Edge-triggered mechanism) |
                  +-------------------------------------------+
                              ||
                              ||  (No new edge occurs when IE is re-enabled)
                              \/
         [Pending events in the event ring not triggering an interrupt]

Thanks!
Best Regards
Devyn