Re: [PATCH 2/2] USB: hub: change the locking in hub_activate

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 1 Sep 2016 10:29:20 -0400 (EDT)

On Thu, 1 Sep 2016, Viresh Kumar wrote:

> On 31-08-16, 12:46, Alan Stern wrote:
> > On Wed, 31 Aug 2016, Viresh Kumar wrote:
> > 
> > > On 05-08-16, 11:51, Alan Stern wrote:
> > > > +++ usb-4.x/drivers/usb/core/hub.c
> > > > @@ -1052,7 +1052,7 @@ static void hub_activate(struct usb_hub
> > > >  
> > > >  	/* Continue a partial initialization */
> > > >  	if (type == HUB_INIT2 || type == HUB_INIT3) {
> > > > -		device_lock(hub->intfdev);
> > > > +		device_lock(&hdev->dev);
> > > 
> > > Hi Alan,
> > > 
> > > I have received reports of kernel crashes (NULL dereference) due to this patch
> > > in some of the corner cases. Note that we have backported this patch (and few
> > > other) to 3.10 kernel. I have attached my hub.c file as well for reference.
> > > 
> > > Here is the reported kernel OOPs:
> > > 
> > > [   19.476228] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > > [   19.476231] pgd = ffffffc00007d000
> > > [   19.476236] [00000000] *pgd=000000000e90b003, *pmd=000000000e90c003, *pte=00e00000f9000407
> > > [   19.476242] Internal error: Oops: 96000045 [#1] PREEMPT SMP
> > > [   19.476273] Modules linked in: gb_vibrator(O) gb_usb(O) gb_uart(O) gb_spi(O) gb_sdio(O) gb_raw(O) gb_pwm(O) gb_power_supply(O) gb)
> > > [   19.476279] CPU: 0 PID: 344 Comm: kworker/0:3 Tainted: G           O 3.10.97-g4b7224f-dirty #454
> > > [   19.476290] Workqueue: events hub_init_func2
> > > [   19.476293] task: ffffffc09b3560c0 ti: ffffffc09ada8000 task.ti: ffffffc09ada8000
> > > [   19.476300] PC is at __mutex_lock_slowpath+0x138/0x224
> > > [   19.476303] LR is at __mutex_lock_slowpath+0x128/0x224
> > ...
> > > [   19.476582] Call trace:
> > > [   19.476586] [<ffffffc000ccf13c>] __mutex_lock_slowpath+0x138/0x224
> > > [   19.476590] [<ffffffc000ccf254>] mutex_lock+0x2c/0x48
> > > [   19.476593] [<ffffffc0006f6eac>] hub_activate+0x50/0x4d8
> > > [   19.476596] [<ffffffc0006f7388>] hub_init_func2+0x14/0x1c
> > > [   19.476602] [<ffffffc0002387ac>] process_one_work+0x26c/0x3cc
> > > [   19.476605] [<ffffffc000239988>] worker_thread+0x208/0x358
> > > [   19.476610] [<ffffffc00023f360>] kthread+0xbc/0xc4
> > ...
> > > This happens when the device is infinitely generating connected and then removed
> > > (not manually, but due to some hardware issues).
> > 
> > If I'm reading this right, it means that hub->hdev is NULL in
> 
> I am not sure I am in sync here :(
> 
> > hub_activate().
> 
> We would have gotten the crash right from hub_activate() in that case, isn't it?
> 
> The fact that the call sequence reached mutex_lock() here, it means that
> hub->hdev->dev was valid at least. The mutex dev->mutex is somewhat corrupted or
> uninitialized, etc..

Ah, that makes a lot more sense.  Thanks for staightening me out.  And 
some pointer embedded in the mutex must have been NULL.

>  And that's where it all went wrong. As
> __mutex_lock_slowpath() is called, it means that the mutex had a count of 0
> instead of 1 during the lock and then we crashed during __mutex_lock_slowpath(),
> which can only happen if the mutex is uninitialized in the first place.
> 
> The mutex gets initialized as part of device_add() and so things can go wrong if
> device_add() was skipped here somehow.
> 
> I may be completely wrong, but that's what I read :)

Okay.  But I don't see how device_add() could have been skipped.  The
hub_init_func2() call occurs after the original HUB_INIT hub_activate()
call, which is in hub_configure() and occurs during probing.  The hub
interface doesn't get probed until the hub device is registered (the
hub interface is a child of the hub device).

Another possibility is that the mutex _was_ initialized but got 
corrupted somehow.  Of course, that kind of thing is very hard to track 
down.

On Thu, 1 Sep 2016, Vaibhav Hiremath wrote:

> I have some more update on this,
>
> It seems the culprit was my laptop USB port (I have to say bad port),
> which resulted into continuous connect/disconnect event, on both
> laptop and phone (Android), which eventually resulted into kernel panic.

Well, the bad port was the trigger.  But even with a bad port, the 
kernel should not panic.

> It could be a memory corruption or race somewhere (not sure though),
> which is where device is not initialized properly when execution reached
> to hub_activate().
> 
> I connected phone to another USB port and things started working as
> expected.
> 
> I can put prints to trace the execution flow, lets see what comes out
> from it.

How easily can you reproduce the problem?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html