RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to unrecoverable loop.

Leo Li <leoyang.li@xxxxxxx> · Mon, 29 Nov 2021 23:37:17 +0000

> -----Original Message-----
> From: Eugene Bordenkircher <Eugene_Bordenkircher@xxxxxxxxxx>
> Sent: Monday, November 29, 2021 11:25 AM
> To: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>; jocke@xxxxxxxxxxxx
> <joakim.tjernlund@xxxxxxxxxxxx>; linuxppc-dev@xxxxxxxxxxxxxxxx; linux-
> usb@xxxxxxxxxxxxxxx
> Cc: Leo Li <leoyang.li@xxxxxxx>; gregkh@xxxxxxxxxxxxxxxxxxx;
> balbi@xxxxxxxxxx
> Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to
> unrecoverable loop.
> 
> The final result of our testing is that the patch set posted seems to address all
> known defects in the Linux kernel.  The mentioned additional problems are
> entirely caused by the antivirus solution on the windows box.  The antivirus
> solution blocks the disconnect messages from reaching the RNDIS driver so it
> has no idea the USB device went away.  There is nothing we can do to
> address this in the Linux kernel.

Thanks for the confirmation.

> 
> I propose we move forward with the patchset.

I think that we should proceed to merge the patchset but it seems to need some cleanup for coding style issues and better description before submitted formally.

> 
> Eugene T. Bordenkircher
> 
> -----Original Message-----
> From: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>
> Sent: Thursday, November 25, 2021 5:59 AM
> To: Eugene Bordenkircher <Eugene_Bordenkircher@xxxxxxxxxx>; Thorsten
> Leemhuis <regressions@xxxxxxxxxxxxx>; Joakim Tjernlund
> <Joakim.Tjernlund@xxxxxxxxxxxx>; linuxppc-dev@xxxxxxxxxxxxxxxx; linux-
> usb@xxxxxxxxxxxxxxx
> Cc: leoyang.li@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; balbi@xxxxxxxxxx
> Subject: Re: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to
> unrecoverable loop.
> 
> Hi, this is your Linux kernel regression tracker speaking.
> 
> Top-posting for once, to make this easy to process for everyone:
> 
> Li Yang and Felipe Balbi: how to move on with this? It's quite an old
> regression, but nevertheless it is one and thus should be fixed. Part of my
> position is to make that happen and thus remind developers and maintainers
> about this until the regression is resolved.
> 
> Ciao, Thorsten
> 
> On 16.11.21 20:11, Eugene Bordenkircher wrote:
> > On 02.11.21 22:15, Joakim Tjernlund wrote:
> >> On Sat, 2021-10-30 at 14:20 +0000, Joakim Tjernlund wrote:
> >>> On Fri, 2021-10-29 at 17:14 +0000, Eugene Bordenkircher wrote:
> >>
> >>>> We've discovered a situation where the FSL udc driver
> (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating over the
> request queue, but the queue has been corrupted at some point so it loops
> infinitely.  I believe we have narrowed into the offending code, but we are in
> need of assistance trying to find an appropriate fix for the problem.  The
> identified code appears to be in all versions of the Linux kernel the driver
> exists in.
> >>>>
> >>>> The problem appears to be when handling a USB_REQ_GET_STATUS
> request.  The driver gets this request and then calls the ch9getstatus()
> function.  In this function, it starts a request by "borrowing" the per device
> status_req, filling it in, and then queuing it with a call to list_add_tail() to add
> the request to the endpoint queue.  Right before it exits the function
> however, it's calling ep0_prime_status(), which is filling out that same
> status_req structure and then queuing it with another call to list_add_tail() to
> add the request to the endpoint queue.  This adds two instances of the exact
> same LIST_HEAD to the endpoint queue, which breaks the list since the prev
> and next pointers end up pointing to the wrong things.  This ends up causing
> a hard loop the next time nuke() gets called, which happens on the next
> setup IRQ.
> >>>>
> >>>> I'm not sure what the appropriate fix to this problem is, mostly due to
> my lack of expertise in USB and this driver stack.  The code has been this way
> in the kernel for a very long time, which suggests that it has been working,
> unless USB_REQ_GET_STATUS requests are never made.  This further
> suggests that there is something else going on that I don't understand.
> Deleting the call to ep0_prime_status() and the following ep0stall() call
> appears, on the surface, to get the device working again, but may have side
> effects that I'm not seeing.
> >>>>
> >>>> I'm hopeful someone in the community can help provide some
> information on what I may be missing or help come up with a solution to the
> problem.  A big thank you to anyone who would like to help out.
> >>>
> >>> Run into this to a while ago. Found the bug and a few more fixes.
> >>> This is against 4.19 so you may have to tweak them a bit.
> >>> Feel free to upstream them.
> >>
> >> Curious, did my patches help? Good to known once we upgrade as well.
> >
> > There's good news and bad news.
> >
> > The good news is that this appears to stop the driver from entering an
> > infinite loop, which prevents the Linux system from locking up and
> > never recovering.  So I'm willing to say we've made the behavior
> > better.
> >
> > The bad news is that once we get past this point, there is new bad
> > behavior.  What is on top of this driver in our system is the RNDIS
> > gadget driver communicating to a Laptop running Win10 -1809.
> > Everything appears to work fine with the Linux system until there is a
> > USB disconnect.  After the disconnect, the Linux side appears to
> > continue on just fine, but the Windows side doesn't seem to recognize
> > the disconnect, which causes the USB driver on that side to hang
> > forever and eventually blue screen the box.  This doesn't happen on
> > all machines, just a select few.   I think we can isolate the
> > behavior to a specific antivirus/security software driver that is
> > inserting itself into the USB stack and filtering the disconnect
> > message, but we're still proving that.
> >
> > I'm about 90% certain this is a different problem and we can call this
> > patchset good, at least for our test setup.  My only hesitation is if
> > the Linux side is sending a set of responses that are confusing the
> > Windows side (specifically this antivirus) or not.  I'd be content
> > calling that a separate defect though and letting this one close up
> > with that patchset.
> 
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my
> table. I can only look briefly into most of them. Unfortunately therefore I
> sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to tell me
> about it in a public reply. That's in everyone's interest, as what I wrote above
> might be misleading to everyone reading this; any suggestion I gave they
> thus might sent someone reading this down the wrong rabbit hole, which
> none of us wants.
> 
> BTW, I have no personal interest in this issue, which is tracked using regzbot,
> my Linux kernel regression tracking bot
> (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furld
> efense.com%2Fv3%2F__https%3A%2F%2Flinux-
> regtracking.leemhuis.info%2Fregzbot%2F__%3B!!O7uE89YCNVw!aHa5_mLM
> nBeDjINlAtV19tBHm-
> He9jbusXucMA5h7oonHvNFwYpOHAaaqqewPOuGK9HAzJUz%24&amp;data
> =04%7C01%7Cleoyang.li%40nxp.com%7C859ce1560a7344729cea08d9b35d2e
> 67%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6377380350721308
> 84%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
> zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ONQZyAKXNgok
> 6LgYvnaAL7LVY%2B5Wl7pXglZDqWUJZMc%3D&amp;reserved=0 ). I'm only
> posting this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
> 
> #regzbot title: usb: fsl_udc_core: corrupted request list leads to
> unrecoverable loop