[RFC] usb, ehci: repeated failover leaves leftover qh_next.ptr

Don Zickus <dzickus@xxxxxxxxxx> · Thu, 31 Mar 2011 17:13:48 -0400

I am working with a customer who has a pair of systems with a single CDROM
that switches between the systems during a failover.  They ran a stress test
and after a few hours and thousands of failovers, they stumbled upon this
BUG() in drivers/usb/echi-mem.c

static void qh_destroy(struct ehci_qh *qh)
{
        struct ehci_hcd *ehci = qh->ehci;

        /* clean qtds first, and know this is not linked */
        if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
					   ^^^^^^^^^^^^^^^
                ehci_dbg (ehci, "unused qh not empty!\n");
                BUG ();
        }

Their analysis is as follows:

------------------
We believe we've tracked down the root cause of this problem.  It
relates to hot-unplug (surprise disappearance) of an ehci_hcd device.

The problem begins at the top of routine scan_async() in ehci-q.c:

1242         ehci->stamp = ehci_readl(ehci, &ehci->regs->frame_index);

If the ehci has gone away, this ehci_readl() will return ~0U (all
ones), per the PCI spec.

If there are completions to process, the conditional

1249                         if (!list_empty (&qh->qtd_list)
1250                                         && qh->stamp != ehci->stamp)

will be true, so qh->stamp will get the value ~0U, and control will
come back around to the 'rescan:' label.

This time around, presumably, the qtd list is empty, so we'll skip
the first conditional, and begin executing the second one:

1275                         if (list_empty(&qh->qtd_list)
1276                                         && qh->qh_state == QH_STATE_LINKED) {
1277                                 if (!ehci->reclaim
1278                                         && ((ehci->stamp - qh->stamp) & 0x1fff)
1279                                                 >= (EHCI_SHRINK_FRAMES * 8))
1280                                         start_unlink_async(ehci, qh);

Assuming that the other conditions are right, the subtraction of the
two timestamps will evaluate to 0, so the comparison will fail, and
start_unlink_async() will not get called.

Even if this code is executed again later, the two stamps will always
have this same value, so start_unlink_async() will never get invoked on
this qh.

In our case, the qh that's not being unlinked happens to be ehci->async->next.
So, when we later invoke pci removal on the ehci HCD device, we trip
over this, from ehci-mem.c, while trying to destroy ehci->async

 67 static void qh_destroy(struct ehci_qh *qh)
 68 {
 69         struct ehci_hcd *ehci = qh->ehci;
 70
 71         /* clean qtds first, and know this is not linked */
 72         if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
 73                 ehci_dbg (ehci, "unused qh not empty!\n");
 74                 BUG ();
 75         }

Since we still have this "undead" qh hanging off of qh->qh_next,
we take the bugcheck here.
-------------

Their solution was to add a check in the code path

1275                         if (list_empty(&qh->qtd_list)
1276                                         && qh->qh_state == QH_STATE_LINKED) {
1277                                 if (!ehci->reclaim
1278                                         && ((ehci->stamp - qh->stamp) & 0x1fff)
1279                                                 >= (EHCI_SHRINK_FRAMES * 8))
1280                                         start_unlink_async(ehci, qh);

to also check for ehci->stamp == ~0U.  It seemed to solve their problem.

We aren't sure if this is the correct way to solve the problem and are
looking for guidance from people more knowledgable.

Thoughts?

Cheers,
Don

--
Note:  this problem was discovered, analyzed, and solved on RHEL-6 which is
closely related to 2.6.32 with regards to the usb2 stack.  The customer
doesn't have the time to use upstream kernels.  The changes in this area
since 2.6.32 seemed minimal enough, that I believe the problem is still
relevant.
---
 drivers/usb/host/ehci-q.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 98ded66..5cb649e 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -1287,7 +1287,8 @@ rescan:
 					&& qh->qh_state == QH_STATE_LINKED) {
 				if (!ehci->reclaim
 					&& ((ehci->stamp - qh->stamp) & 0x1fff)
-						>= (EHCI_SHRINK_FRAMES * 8))
+						>= (EHCI_SHRINK_FRAMES * 8) ||
+						(ehci->stamp == ~0U))
 					start_unlink_async(ehci, qh);
 				else
 					action = TIMER_ASYNC_SHRINK;
-- 
1.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html