> -----Original Message----- > From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx] > Sent: Wednesday, December 8, 2021 2:41 PM > To: Chugh, Sanjeev <Sanjeev_Chugh@xxxxxxxxxx> > Cc: linux-usb@xxxxxxxxxxxxxxx > Subject: Re: [PATCH V1 1/1] usb: hub: introduce delay after usb hub mutex > release > > On Wed, Dec 08, 2021 at 07:26:08AM +0000, Chugh, Sanjeev wrote: > > > > > -----Original Message----- > > > From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx] > > > Sent: Wednesday, December 8, 2021 12:34 PM > > > To: Chugh, Sanjeev <Sanjeev_Chugh@xxxxxxxxxx> > > > Cc: linux-usb@xxxxxxxxxxxxxxx > > > Subject: Re: [PATCH V1 1/1] usb: hub: introduce delay after usb hub > > > mutex release > > > > > > On Wed, Dec 08, 2021 at 06:58:23AM +0000, Chugh, Sanjeev wrote: > > > > Hi Greg, > > > > > > > > > -----Original Message----- > > > > > From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx] > > > > > Sent: Tuesday, December 7, 2021 8:46 PM > > > > > To: Chugh, Sanjeev <Sanjeev_Chugh@xxxxxxxxxx> > > > > > Cc: linux-usb@xxxxxxxxxxxxxxx > > > > > Subject: Re: [PATCH V1 1/1] usb: hub: introduce delay after usb > > > > > hub mutex release > > > > > > > > > > On Tue, Dec 07, 2021 at 05:54:06AM +0000, Chugh, Sanjeev wrote: > > > > > > Hello Greg, > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx] > > > > > > > Sent: Friday, December 3, 2021 6:22 PM > > > > > > > To: Chugh, Sanjeev <Sanjeev_Chugh@xxxxxxxxxx> > > > > > > > Cc: linux-usb@xxxxxxxxxxxxxxx > > > > > > > Subject: Re: [PATCH V1 1/1] usb: hub: introduce delay after > > > > > > > usb hub mutex release > > > > > > > > > > > > > > On Wed, Dec 01, 2021 at 06:28:08AM +0530, Sanjeev Chugh wrote: > > > > > > > > Rogue usb sticks can cause endless port connect change > > > > > > > > events due to unstable electric connection between usb chip and > the port. > > > > > > > > Port connect change will cause device enumeration for the > > > > > > > > new device connection and new device processing is done > > > > > > > > with the usb hub mutex acquired. This can cause very short > > > > > > > > time gap between unlock/lock of hub device mutex. So if > > > > > > > > some thread is running at low priority than hub_thread, it > > > > > > > > can face hub device mutex > > > starvation. > > > > > > > > > > > > > > > > This commit introduces a sleep of 25ms after the hub > > > > > > > > device mutex is unlocked in hub_events so that if > > > > > > > > hub_thread is stuck in a endless loop, all other threads > > > > > > > > will get fair amount of chance to acquire the usb hub mutex. > > > > > > > > > > > > > > > > Signed-off-by: Sanjeev Chugh <sanjeev_chugh@xxxxxxxxxx> > > > > > > > > --- > > > > > > > > drivers/usb/core/hub.c | 11 +++++++++++ > > > > > > > > 1 file changed, 11 insertions(+) > > > > > > > > > > > > > > > > diff --git a/drivers/usb/core/hub.c > > > > > > > > b/drivers/usb/core/hub.c index 00070a8..0be2acc 100644 > > > > > > > > --- a/drivers/usb/core/hub.c > > > > > > > > +++ b/drivers/usb/core/hub.c > > > > > > > > @@ -5763,6 +5763,17 @@ static void hub_event(struct > > > > > > > > work_struct > > > > > > > > *work) > > > > > > > > out_hdev_lock: > > > > > > > > usb_unlock_device(hdev); > > > > > > > > > > > > > > > > + /* > > > > > > > > + * Rogue usb sticks can cause endless device connection > > > > > > > > + * events due to unstable electric connection. > > > > > > > > > > > > > > What is a "usb stick"? Any usb device could cause this, right? > > > > > > > > > > > > In my test case, I have used a usb pen drive of Sandisk Make > > > > > > of 32 GB > > > capacity. > > > > > But yes, any faulty usb device or even faulty usb cable used for > > > > > the usb device connection can cause this issue. > > > > > > > > > > > > > > > This > > > > > > > > + * can cause very short time gap between unlock/lock > > > > > > > > + * of hub device mutex thus causing mutex starvation > > > > > > > > + * for some other lower priority thread. Thus sleep > > > > > > > > + * would give fair chance to all other threads to > > > > > > > > + * acquire the usb hub mutex. > > > > > > > > + */ > > > > > > > > + msleep(25); > > > > > > > > > > > > > > What normal code path did you just slow down here? Why not > > > > > > > slow down the disconnect path instead of the connect path? > > > > > > > > > > > > > In case of faulty usb device, USB device enumeration doesn't > > > > > > succeed at all > > > > > and comes out of hub_port_init() from a failure path. There is > > > > > no device disconnection path in this flow. Therefore, this delay > > > > > is added after > > > > > hub_port_connect_change() is done trying to handle the port > > > > > connect change event. > > > > > > > > > > So you are going to slow down all valid devices just to handle > > > > > the fact that an abusive device can cause the system to slow down? > > > > > > > > > > What exactly is the problem if you plug in an abusive device, > > > > > that you are trying to solve? > > > > > > > > My apologies for not explaining the impact of failure at first place itself. > > > > > > > > Our end product is a car radio. If a faulty usb device(pen drive) > > > > is connected to > > > the radio system, unstable electric connection between the usb > > > device and the usb hub port can lead to endless port connect change > > > events notified to the EHCI controller. Thus hub_event() is forced > > > to process all the connection change events with hub device mutex > > > lock acquired in the beginning of the function and released only > > > after event is processed. On every attempt, hub_port_init() fails with usb > commands timing out. E.g. > > > > > > > > [ 226.854368] usb 1-1.3: new high-speed USB device number 7 using > > > > ci_hdrc [ 231.934052] usb 1-1.3: device descriptor read/64, error > > > > -110 [ 247.124276] usb 1-1.3: device descriptor read/64, error > > > > -110 [ 247.494572] usb 1-1.3: new high-speed USB device number 9 > > > > using ci_hdrc [ 252.574226] usb 1-1.3: device descriptor read/64, > > > > error -110 [ 267.764276] usb 1-1.3: device descriptor read/64, > > > > error -110 [ 268.134641] usb 1-1.3: new high-speed USB device > > > > number 11 using ci_hdrc [ 273.214691] usb 1-1.3: device descriptor > > > > read/64, error -110 [ 288.404281] usb 1-1.3: device descriptor > > > > read/64, error -110 [ 288.777986] usb 1-1.3: new high-speed USB > > > > device number 13 using ci_hdrc [ 293.857331] usb 1-1.3: device > > > > descriptor read/64, error -110 [ 309.044488] usb 1-1.3: device > > > > descriptor read/64, error -110 [ 309.414428] usb 1-1.3: new > > > > high-speed USB device number 15 using ci_hdrc [ 314.494496] usb > > > > 1-1.3: device descriptor read/64, error -110 > > > > > > > > This goes on and on. > > > > > > You have plenty of time between those events (seconds), so why will > > > a 25msec sleep help out here? > > > > > > > Well, these seconds are the spent due to usb device descriptor command > failing with timeout (5 seconds for one usb device command multiplied by 4 re- > tries). This time is spent inside hub_port_init() function when the hub device lock > is already taken so how would it help the lock waiter threads ? We need to give > the hub driver thread breathing time only after hub device lock is released. > > Ah, you are right, sorry. > > And can you wrap your text at 72 colums? Those are very long lines :) > > > > > Due to quick and endless number of new connection change events, > > > > thread > > > running hub_event() function remains busy processing the hub port > > > change events and being frequently scheduled. During this time, any > > > other threads (either low priority or even equal priority) are starved of the > hub device mutex. > > > > > > > > Since threads waiting for the hub device mutex lock are in > > > > uninterruptible > > > sleep, this leads to kernel hung task detector triggering hung task > > > after > > > 180(hung_task_timeout_secs) seconds of threads waiting for the hub > > > device mutex. And since we have hung_task_panic set to 1, it leads to kernel > panic. > > > > > > > > We can reproduce this issue always with a faulty usb device that > > > > is capable of > > > sending endless connect status change events due to unstable > > > electric connection between usb device and the hub port. > > > > > > > > > Is the system still supposed to be running properly? Why not > > > > > just remove the device or disable the port when detected? > > > > > > > > > > > > > I hope you would agree that user of an embedded Linux based > > > > product is not > > > always expected to remove the faulty usb device before hung task is > > > triggered. I mean, user just simply won't know. > > > > > > > > If we have to make a decision on disabling the port, we must know > > > > for sure > > > that same usb device is being attempted for enumeration. Let say, we > > > want to make a decision to abort enumeration after certain number of > > > re-tries. But for that, how would we know that new port connect > > > change event is for the same usb device that was tried last time ? > > > Because since hub_port_init() function itself fails and till this > > > point, we don't have unique information about the usb device being > > > attached to the hub port that we can compare to know if it's a re-try for the > same usb device or it's a new usb device. > > > > > > I agree, trying to do this type of detection would be hard, but > > > again, as your log points out, you have seconds between events, so why is > your change helping? > > > > > > And again, you are hurting valid devices here by slowing everything > > > down, why not just delay when a disconnect happens? > > > > > Again, there is no disconnection path because there was no successful > connection. So what's happening is, as soon as the thread releases the hub > device lock, it's taken again by hub driver thread within microseconds. Think of a > scenario where there is a kernel thread running with lower priority requiring hub > device mutex and it just won't be allowed because hub driver thread running > with high priority demands to run endlessly. > > > > > Or just allow the cpu to reschedule? Why did you pick "25" for this magic > value? > > > > > Magic value of 25ms was decided to cause as smaller delay as possible that > prevents the hung task. We did experiments with the higher delay value but not > less than 25ms. We wanted to avoid the too short delay which might convert > into busy waiting. I am sure it can be further brought down and I'm willing to > test with even smaller delays. > > > > Can you please briefly explain on what do you mean by " just allow the cpu to > reschedule" ? Isn't that what delay is doing ? > > It is, but it is also delaying a specific amount of time. What happens if there is > no contention on the lock, you just delayed no matter what. > > How about a call to schedule()? > I can try with schedule() if it works. Will test and post results. > thanks, > > greg k-h Thanks ! -Sanjeev