Re: Hitting "unused qh not empty" BUG in qh_destroy

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 15 Sep 2014 11:59:15 -0400 (EDT)

On Sat, 13 Sep 2014, Joe Lawrence wrote:

> Hi Alan,
> 
> I've collected 16 crashes since kicking off automated tests a little
> over 24 hrs ago.
> 
> Each crash hit the BUG in qh_destroy() and the only unique debugging
> printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0".

What about error messages from the check_async_ring routine?

> I can include (or upload) a full (or filtered) vmcore-dmesg.txt if that
> would be more helpful.

Only if it includes one of those messages.

> The debug code I added as you suggested is provided below...
> 
> Thanks,
> 
> -- Joe
> 
> 
> diff -Nupr before/drivers/usb/host/ehci.h after/drivers/usb/host/ehci.h
> --- before/drivers/usb/host/ehci.h	2014-07-16 14:25:31.000000000 -0400
> +++ after/drivers/usb/host/ehci.h	2014-09-12 15:47:18.943478397 -0400
> @@ -231,6 +231,8 @@ struct ehci_hcd {			/* one per controlle
>  
>  	/* platform-specific data -- must come last */
>  	unsigned long		priv[0] __aligned(sizeof(s64));
> +
> +	int num_async;
>  };

Not a good idea to put your new field _after_ a field that is
documented as coming last, but in this case it probably won't hurt.

>  /* convert between an HCD pointer and the corresponding EHCI_HCD */
> diff -Nupr before/drivers/usb/host/ehci-hcd.c after/drivers/usb/host/ehci-hcd.c
> --- before/drivers/usb/host/ehci-hcd.c	2014-07-16 14:25:31.000000000 -0400
> +++ after/drivers/usb/host/ehci-hcd.c	2014-09-12 15:53:16.863163627 -0400
> @@ -440,6 +440,8 @@ static void ehci_stop (struct usb_hcd *h
>  	if (ehci->amd_pll_fix == 1)
>  		usb_amd_dev_put();
>  
> +	pr_err("%s: ehci->num_async = %d\n", __func__, ehci->num_async);
> +
>  	dbg_status (ehci, "ehci_stop completed",
>  		    ehci_readl(ehci, &ehci->regs->status));
>  }

Always 0, as it should be.

> diff -Nupr before/drivers/usb/host/ehci-q.c after/drivers/usb/host/ehci-q.c
> --- before/drivers/usb/host/ehci-q.c	2014-07-16 14:25:31.000000000 -0400
> +++ after/drivers/usb/host/ehci-q.c	2014-09-12 15:52:08.023292291 -0400
> @@ -959,6 +959,24 @@ static void disable_async(struct ehci_hc
>  	ehci_poll_ASS(ehci);
>  }
>  
> +static void check_async_ring(struct ehci_hcd *ehci, int add)
> +{
> +	struct ehci_qh *qh;
> +	int n;
> +
> +	qh = ehci->async->qh_next.qh;
> +	n = ehci->num_async += add;
> +	while (qh && n > 0) {
> +		qh = qh->qh_next.qh;
> +		--n;
> +	}
> +	if (qh || n != 0) {
> +		ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n",
> +				ehci->num_async, n, qh);
> +		BUG();
> +	}
> +}

In the last call to this function, ehci->num_async must get set to 0 
(because that is the final value as printed out in ehci_stop).  This 
means n = 0, which means ehci->async->qh_next.qh must be NULL.  And 
this routine is the only place where ehci->num_async gets changed.

But then ehci->async->qh_next.qh isn't NULL when the final qh_destroy 
call is made.  This suggests that something changes it in the meantime.

You can check for this.  Sprinkle ehci_info messages throughout 
ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
NULL the entire time.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html