RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

Rajesh Bhagat <rajesh.bhagat@xxxxxxx> · Thu, 31 Mar 2016 03:51:16 +0000

> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@xxxxxxxxxxxxxxx]
> Sent: Tuesday, March 29, 2016 10:51 PM
> To: Rajesh Bhagat <rajesh.bhagat@xxxxxxx>
> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; Sriram Dash <sriram.dash@xxxxxxx>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 28.03.2016 09:13, Rajesh Bhagat wrote:
> >
> >
> >> -----Original Message-----
> >> From: Mathias Nyman [mailto:mathias.nyman@xxxxxxxxxxxxxxx]
> >> Sent: Wednesday, March 23, 2016 7:52 PM
> >> To: Rajesh Bhagat <rajesh.bhagat@xxxxxxx>
> >> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; linux-
> >> kernel@xxxxxxxxxxxxxxx; Sriram Dash <sriram.dash@xxxxxxx>
> >> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation
> >> due to XHCI commmand timeout
> >>
> >> On 23.03.2016 05:53, Rajesh Bhagat wrote:
> >>
> >>>>> IMO, The assumption that "xhci_abort_cmd_ring would always
> >>>>> generate an event and handle_cmd_completion would be called" will
> >>>>> not be always be true if HW
> >>>> is in bad state.
> >>>>>
> >>>>> Please share your opinion.
> >>>>>
> >>>>
> >>>> writing the CA (command abort) bit in CRCR (command ring control
> >>>> register)  will stop the command ring, and CRR (command ring
> >>>> running) will be set
> >> to 0 by xHC.
> >>>> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
> >>>> If it's not 0 then the driver considers the command abort as failed.
> >>>>
> >>>> The scenario you're thinking of is that xHC would still react to CA
> >>>> bit set, it would stop the command ring and set CRR 0, but not send
> >>>> a command
> >> completion event.
> >>>>
> >>>> Have you tried adding some debug to handle_cmd_completion() and see
> >>>> if you receive any event after command abortion?
> >>>>
> >>>
> >>> Yes. We have added debug prints at first line of
> >>> handle_cmd_completion, and we are not getting those prints. The last
> >>> print messages that we get are as below from xhci_alloc_dev while
> >>> resume
> >>> operation:
> >>>
> >>> xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> >>> Abort command ring
> >>>
> >>> May be somehow, USB controller is in bad state and not responding to
> >>> the
> >> commands.
> >>>
> >>> Please suggest how XHCI driver can handle such situations.
> >>>
> >>
> >> Restart the command timeout timer when writing the command abort bit.
> >> If we get theIf we get the abort event the timer is deleted.
> >>
> >> Otherwise if the timout triggers a second time we end up calling
> >> xhci_handle_command_timeout() with a stopped ring, This will call
> >> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op,
> >> restart the command ring, and finally when the no-op completes it
> >> should call the missing completion.
> >>
> >> If command ring doesn't start then additional code could be added to
> >> xhci_handle_command_timeout() that clears the command ring if it is
> >> called a second time (=current command is already in abort state and
> >> command ring is stopped when entering xhci_handle_command_timeout)
> >>
> >> There might be some details missing, I'm not able to test any of
> >> this, but try something like this:
> >>
> >> diff --git a/drivers/usb/host/xhci-ring.c
> >> b/drivers/usb/host/xhci-ring.c index 3e1d24c..576819e 100644
> >> --- a/drivers/usb/host/xhci-ring.c
> >> +++ b/drivers/usb/host/xhci-ring.c
> >> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >>                   xhci_halt(xhci);
> >>                   return -ESHUTDOWN;
> >>           }
> >> -
> >> +       /* writing the CMD_RING_ABORT bit should create a command completion
> >> +        * event, add a command completion timeout for it as well
> >> +        */
> >> +       mod_timer(&xhci->cmd_timer, jiffies +
> >> + XHCI_CMD_DEFAULT_TIMEOUT);
> >>           return 0;
> >>    }
> >
> > Hello Mathias,
> >
> > Thanks for the patch.
> >
> > After application of above patch, I'm getting following prints constantly:
> >
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> > ...
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> >
> > As expected, xhci_handle_command_timeout is called again and next time
> > ring state is __not__ CMD_RING_STATE_RUNNING, Hence
> > xhci_handle_stopped_cmd_ring is called which turn all the aborted
> > commands to no-ops and again makes the ring state as
> CMD_RING_STATE_RUNNING, and rings the door bell.
> >
> > But again in this case, no response from USB controller and
> > xhci_alloc_dev is still waiting for wait_for_completion.
> >
> > Does rescheduling the xhci->cmd_timer ensures command completion will
> > be called. IMO, it is still dependent on HW response.
> >
> 
> It's still dependent on hw starting the command ring, which apparently fails
> 
> So additional code is needed that will treat a second or third timeout as a failed abort
> attempt, i.e. if a command times out on a stopped ring and the command status is
> already aborted, then we cleanup the command ring and kill xhci.
> 
> So in addition to the previous added timeout something like this is needed:
> (Again, uncompiled, untested semi pseudocode)
> 
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index
> 7cf6621..d8518e8f 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1253,6 +1253,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          int ret;
>          unsigned long flags;
>          u64 hw_ring_state;
> +       u32 old_status;
>          struct xhci_command *cur_cmd = NULL;
>          xhci = (struct xhci_hcd *) data;
> 
> @@ -1260,6 +1261,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          spin_lock_irqsave(&xhci->lock, flags);
>          if (xhci->current_cmd) {
>                  cur_cmd = xhci->current_cmd;
> +               old_status = cur_cmd->status;
>                  cur_cmd->status = COMP_CMD_ABORT;
>          }
> 
> @@ -1282,7 +1284,14 @@ void xhci_handle_command_timeout(unsigned long data)
>          }
>          /* command timeout on stopped ring, ring can't be aborted */
>          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +       /* is this the second timeout for the same command? ring wont restart */
> +       if (old_status == COMP_CMD_ABORT) {
> +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> +               xhci_cleanup_command_queue(xhci);
> +               /* everything else here to halt, cleanup, free etc kill xhc */
> +       } else {
> +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +       }
>          spin_unlock_irqrestore(&xhci->lock, flags);
>          return;
>   }
> 

Hello Mathias,

Thanks a lot for your support. 

Attached patch is working and allows the completion handler to be called. And resume operation completes 
and shell comes back (but with lot of errors). 

Now, some other points which needs our attention here are: 
1. usb core hub code is not checking the error status of hcd->driver->reset_device(xhci_discover_or_reset_device)
    and continues considering reset_device was successful (and causes other issues). Hence, a check is needed on return 
    value of reset_device and proper action is required on it. 
2. Even if above return value is checked and reset_device status is used. USB core hub retries for N times which is not
    required in this case as adding to the resume operation time. But if in this case we return -ENOTCONN instead of -EINVAL
    from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns early and doesn't retry. 

Proposed Solution for above with your suggested patches is as below:

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 7cad1fa..efeba31 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2807,13 +2807,17 @@ done:
                        struct usb_hcd *hcd = bus_to_hcd(udev->bus);
 
                        update_devnum(udev, 0);
-                       /* The xHC may think the device is already reset,
-                        * so ignore the status.
+                       /*
+                        * check the status of resetting the device and update
+                        * the udev status.
                         */
                        if (hcd->driver->reset_device)
-                               hcd->driver->reset_device(hcd, udev);
+                               status = hcd->driver->reset_device(hcd, udev);
 
-                       usb_set_device_state(udev, USB_STATE_DEFAULT);
+                       if (status == 0)
+                               usb_set_device_state(udev, USB_STATE_DEFAULT);
+                       else
+                               usb_set_device_state(udev, USB_STATE_NOTATTACHED);
                }
        } else {
                if (udev)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b3a0a22..9427175f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
                return -ESHUTDOWN;
        }
 
+       /* writing the CMD_RING_ABORT bit should create a command completion
+        * event, add a command completion timeout for it as well
+        */
+       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
        return 0;
 }
 
@@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
        int ret;
        unsigned long flags;
        u64 hw_ring_state;
+       u32 old_status;
        struct xhci_command *cur_cmd = NULL;
        xhci = (struct xhci_hcd *) data;
 
@@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
        spin_lock_irqsave(&xhci->lock, flags);
        if (xhci->current_cmd) {
                cur_cmd = xhci->current_cmd;
+               old_status = cur_cmd->status;
                cur_cmd->status = COMP_CMD_ABORT;
        }
 
@@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long data)
        }
        /* command timeout on stopped ring, ring can't be aborted */
        xhci_dbg(xhci, "Command timeout on stopped ring\n");
-       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
+       /* is this the second timeout for the same command? ring wont restart */
+       if (old_status == COMP_CMD_ABORT) {
+               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
+               xhci_cleanup_command_queue(xhci);
+       /* everything else here to halt, cleanup, free etc kill xhc */
+       } else
+               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
        spin_unlock_irqrestore(&xhci->lock, flags);
        return;
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index c502c22..bd18966 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd, struct usb_device *udev)
                if (ret == 1)
                        return 0;
                else
-                       return -EINVAL;
+                       return -ENOTCONN;     /* Don't retry */
        }
 
        if (virt_dev->tt_info)

Sample output after above patch (timer is set as wakeup source):

root@phoenix:~# echo mem > /sys/power/state    
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
PM: suspend of devices complete after 155.658 msecs
PM: late suspend of devices complete after 1.594 msecs
PM: noirq suspend of devices complete after 1.496 msecs
PM: noirq resume of devices complete after 1.290 msecs
PM: early resume of devices complete after 1.250 msecs
usb usb1: root hub lost power or was reset
usb usb2: root hub lost power or was reset
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
PM: resume of devices complete after 25436.366 msecs     <= resume time is increased a lot even after using -ENOTCONN
Restarting tasks ... 
usb 1-1: USB disconnect, device number 2
usb 1-1.2: USB disconnect, device number 3
usb 2-1: USB disconnect, device number 2
done.
root@phoenix:~#

Please share your opinion on other changes for patch submission as well as resume time. 

- Rajesh Bhagat 

> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html