Re: [PATCH 2/7] usb: xhci: Check endpoint is valid before dereferencing it

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Tue, 17 Jan 2023 12:02:59 +0200

On 16.1.2023 18.59, Ladislav Michl wrote:
Hi Mathias,

On Mon, Jan 16, 2023 at 04:22:11PM +0200, Mathias Nyman wrote:
From: Jimmy Hu <hhhuuu@xxxxxxxxxx>

When the host controller is not responding, all URBs queued to all
endpoints need to be killed. This can cause a kernel panic if we
dereference an invalid endpoint.

Fix this by using xhci_get_virt_ep() helper to find the endpoint and
checking if the endpoint is valid before dereferencing it.

I'm a bit confused this goes in and even to stable. Let me quote your
own analysis from
Message-ID: <0fe978ed-8269-9774-1c40-f8a98c17e838@xxxxxxxxxxxxxxx>
On Thu, Dec 22, 2022 at 03:18:53PM +0200, Mathias Nyman wrote:
I think root cause is that freeing xhci->devs[i] and including rings isn't
protected by the lock, this happens in xhci_free_virt_device() called by
xhci_free_dev(), which in turn may be called by usbcore at any time

So xhci->devs[i] might just suddenly disappear

Patch just checks more often if xhci->devs[i] is valid, between every endpoint.
So the race between xhci_free_virt_device() and xhci_kill_endpoint_urbs()
doesn't trigger null pointer deref as easily.> 
I believe the above is correct and even Jimmy was unable to verify your
later patch (3rd in this serie), which brings a question how could be this
patch tested. It just burns a bug a bit deeper and I do not think it is the
right approach.

As I said in a direct response to the original patch I think this is a valid fix
for older kernels where we used to unlock xhci->lock while giving back
URBs. Together with PATCH 3/7 the issue should be completely resolved.
For later kernels PATCH 3/7 should be enough by itself, but no harm in keeping this.

See Message-ID: <379b395f-b65c-96fe-7ecc-f18e3740b990@xxxxxxxxxxxxxxx>

Older kernels are all before v5.5 that lack commit
36dc01657b49 usb: host: xhci: Support running urb giveback in tasklet context.

I haven't been able to trigger this issue myself, but based on the report and finding in
the code I still think this the right approach. The internal testing this has been
through could only prove these patches (2/7 and 3/7) don't cause any additional issues.

If you think the analysis or solution is incorrect let me know, and we can come up with a
better one.

Thanks
Mathias