> -----Original Message----- > From: Eugene Bordenkircher <Eugene_Bordenkircher@xxxxxxxxxx> > Sent: Monday, November 29, 2021 11:25 AM > To: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>; jocke@xxxxxxxxxxxx > <joakim.tjernlund@xxxxxxxxxxxx>; linuxppc-dev@xxxxxxxxxxxxxxxx; linux- > usb@xxxxxxxxxxxxxxx > Cc: Leo Li <leoyang.li@xxxxxxx>; gregkh@xxxxxxxxxxxxxxxxxxx; > balbi@xxxxxxxxxx > Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to > unrecoverable loop. > > The final result of our testing is that the patch set posted seems to address all > known defects in the Linux kernel. The mentioned additional problems are > entirely caused by the antivirus solution on the windows box. The antivirus > solution blocks the disconnect messages from reaching the RNDIS driver so it > has no idea the USB device went away. There is nothing we can do to > address this in the Linux kernel. Thanks for the confirmation. > > I propose we move forward with the patchset. I think that we should proceed to merge the patchset but it seems to need some cleanup for coding style issues and better description before submitted formally. > > Eugene T. Bordenkircher > > -----Original Message----- > From: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> > Sent: Thursday, November 25, 2021 5:59 AM > To: Eugene Bordenkircher <Eugene_Bordenkircher@xxxxxxxxxx>; Thorsten > Leemhuis <regressions@xxxxxxxxxxxxx>; Joakim Tjernlund > <Joakim.Tjernlund@xxxxxxxxxxxx>; linuxppc-dev@xxxxxxxxxxxxxxxx; linux- > usb@xxxxxxxxxxxxxxx > Cc: leoyang.li@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; balbi@xxxxxxxxxx > Subject: Re: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to > unrecoverable loop. > > Hi, this is your Linux kernel regression tracker speaking. > > Top-posting for once, to make this easy to process for everyone: > > Li Yang and Felipe Balbi: how to move on with this? It's quite an old > regression, but nevertheless it is one and thus should be fixed. Part of my > position is to make that happen and thus remind developers and maintainers > about this until the regression is resolved. > > Ciao, Thorsten > > On 16.11.21 20:11, Eugene Bordenkircher wrote: > > On 02.11.21 22:15, Joakim Tjernlund wrote: > >> On Sat, 2021-10-30 at 14:20 +0000, Joakim Tjernlund wrote: > >>> On Fri, 2021-10-29 at 17:14 +0000, Eugene Bordenkircher wrote: > >> > >>>> We've discovered a situation where the FSL udc driver > (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating over the > request queue, but the queue has been corrupted at some point so it loops > infinitely. I believe we have narrowed into the offending code, but we are in > need of assistance trying to find an appropriate fix for the problem. The > identified code appears to be in all versions of the Linux kernel the driver > exists in. > >>>> > >>>> The problem appears to be when handling a USB_REQ_GET_STATUS > request. The driver gets this request and then calls the ch9getstatus() > function. In this function, it starts a request by "borrowing" the per device > status_req, filling it in, and then queuing it with a call to list_add_tail() to add > the request to the endpoint queue. Right before it exits the function > however, it's calling ep0_prime_status(), which is filling out that same > status_req structure and then queuing it with another call to list_add_tail() to > add the request to the endpoint queue. This adds two instances of the exact > same LIST_HEAD to the endpoint queue, which breaks the list since the prev > and next pointers end up pointing to the wrong things. This ends up causing > a hard loop the next time nuke() gets called, which happens on the next > setup IRQ. > >>>> > >>>> I'm not sure what the appropriate fix to this problem is, mostly due to > my lack of expertise in USB and this driver stack. The code has been this way > in the kernel for a very long time, which suggests that it has been working, > unless USB_REQ_GET_STATUS requests are never made. This further > suggests that there is something else going on that I don't understand. > Deleting the call to ep0_prime_status() and the following ep0stall() call > appears, on the surface, to get the device working again, but may have side > effects that I'm not seeing. > >>>> > >>>> I'm hopeful someone in the community can help provide some > information on what I may be missing or help come up with a solution to the > problem. A big thank you to anyone who would like to help out. > >>> > >>> Run into this to a while ago. Found the bug and a few more fixes. > >>> This is against 4.19 so you may have to tweak them a bit. > >>> Feel free to upstream them. > >> > >> Curious, did my patches help? Good to known once we upgrade as well. > > > > There's good news and bad news. > > > > The good news is that this appears to stop the driver from entering an > > infinite loop, which prevents the Linux system from locking up and > > never recovering. So I'm willing to say we've made the behavior > > better. > > > > The bad news is that once we get past this point, there is new bad > > behavior. What is on top of this driver in our system is the RNDIS > > gadget driver communicating to a Laptop running Win10 -1809. > > Everything appears to work fine with the Linux system until there is a > > USB disconnect. After the disconnect, the Linux side appears to > > continue on just fine, but the Windows side doesn't seem to recognize > > the disconnect, which causes the USB driver on that side to hang > > forever and eventually blue screen the box. This doesn't happen on > > all machines, just a select few. I think we can isolate the > > behavior to a specific antivirus/security software driver that is > > inserting itself into the USB stack and filtering the disconnect > > message, but we're still proving that. > > > > I'm about 90% certain this is a different problem and we can call this > > patchset good, at least for our test setup. My only hesitation is if > > the Linux side is sending a set of responses that are confusing the > > Windows side (specifically this antivirus) or not. I'd be content > > calling that a separate defect though and letting this one close up > > with that patchset. > > P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my > table. I can only look briefly into most of them. Unfortunately therefore I > sometimes will get things wrong or miss something important. > I hope that's not the case here; if you think it is, don't hesitate to tell me > about it in a public reply. That's in everyone's interest, as what I wrote above > might be misleading to everyone reading this; any suggestion I gave they > thus might sent someone reading this down the wrong rabbit hole, which > none of us wants. > > BTW, I have no personal interest in this issue, which is tracked using regzbot, > my Linux kernel regression tracking bot > (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furld > efense.com%2Fv3%2F__https%3A%2F%2Flinux- > regtracking.leemhuis.info%2Fregzbot%2F__%3B!!O7uE89YCNVw!aHa5_mLM > nBeDjINlAtV19tBHm- > He9jbusXucMA5h7oonHvNFwYpOHAaaqqewPOuGK9HAzJUz%24&data > =04%7C01%7Cleoyang.li%40nxp.com%7C859ce1560a7344729cea08d9b35d2e > 67%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6377380350721308 > 84%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM > zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ONQZyAKXNgok > 6LgYvnaAL7LVY%2B5Wl7pXglZDqWUJZMc%3D&reserved=0 ). I'm only > posting this mail to get things rolling again and hence don't need to be CC on > all further activities wrt to this regression. > > #regzbot title: usb: fsl_udc_core: corrupted request list leads to > unrecoverable loop