Re: EHCI: iaa_watchdog_start() warning followed by NULL ptr dereference in start_unlink_async()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alan Stern,

On 9/12/2012 5:59 AM, Alan Stern wrote:
> On Tue, 11 Sep 2012, Hemant Kumar wrote:
> 
>> Hi
>>
>> I came across an issue where I see WARN_ON from iaa_watchdog_start() and
>> after almost 10ms I see NULL ptr dereference in start_unlink_async()
>>  It happens exactly here
>> 	prev = ehci->async;
>> =>	while (prev->qh_next.qh != qh)
>> 		prev = prev->qh_next.qh;
> 
>> Which looks to me that qh that we are trying to unlink is not part for the
>> async list maintained by ehci. Here is the status of ehci_hcd struct at
>> the time of crash
> 
>> This issue was reported when interface suspend happened as a result of
>> runtime suspend and our bridge driver called usb_kill_anchored_urbs().
>> Bridge driver queues 50 rx URBs when it resumes and unlinks them during
>> suspend.
>>
>> This issue is very hard to reproduce (takes around week's time to show
>> up). So I was trying to analyze it statically based on the ram dump but
>> couldn�t figure out of a code path which can show this behavior.
>>
>> Can someone please provide some pointers which can cause this issue to
>> happen or if this is something known ?
> 
> I can't tell from this what happened.  You're right that the qh being
> unlinked was not on the async list at the time.  My first thought was
> that you saw a race between ehci_iaa_watchdog() and iaa_watchdog_done()
> -- which could account for the WARN_ON -- but it's hard to see how that
> would cause the NULL dereference.
>

Thanks for your response. I and Hemant work together. We are thinking a
scenario where this NULL dereference could happen. Please let us know if
our understanding is correct or not. we are assuming that async doorbell
interrupt can not be stopped by software as mentioned at
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=6feff1b92bedab133c5835e510d11f62e843b257

1. Say an interface driver called unlink on an endpoint whose QH is QH1.
we set QH1 state to QH_STATE_UNLINK and remove it from async list; wait
for IAAD interrupt from controller.
2. Before IAAD interrupt occurs, if another function driver called
unlink on its two endpoints whose QH are QH2 and QH3 respectively. As
ehci->reclaim = QH1, we set QH2 and QH3 state as QH_STATE_UNLINK_WAIT.
We set QH1->reclaim = QH2 and QH2->reclaim = QH1. We still are waiting
for QH1 unlink completion.
3. say IAAD interrupt did not occur with in 10msec and iaa_watchdog is
fired. we call end_unlink_async() from timer function with ehci lock. In
end_unlink_async(), we set  QH1->qh_next.qh = NULL and ehci->reclaim =
QH2. next (temporary variable in this function) also points to QH2.
qh_completions() is called for QH1. If at all IAAD interrupt is fired
during (3) and waiting for ehci->lock, it get a chance when ehci lock is
dropped in qh_completions(). This is a SMP system. So timer and IRQ can
run on two different processors.
4. end_unlink_async() is called from interrupt handler. now
ehci->reclaim points to QH2. But QH2 is still in async list and its
state is QH_STATE_UNLINK_WAIT. we set QH2->qh_next.qh = NULL and
ehci->reclaim = QH3. next (temporary variable in this function) also
points to QH3. qh_completions() is called for QH2. ehci lock is dropped
in qh_completions().
5. end_unlink_async() running from timer gets ehci lock. As next points
to QH2, start_unlink_async() is called for QH2. During this function, we
set prev->qh_next = QH2->qh_next; But QH2->qh_next.qh is NULL. so
software async list is broken. At the end of end_unlink_async() we kick
iaa_watchdog timer and return.
6. Now end_unlink_async() running from interrupt handler gets ehci lock.
As next points to QH3, we call start_unlink_async(). if QH3's position
in the async list is before the __broken__ position, we don;t hit NULL
dereference. But we try to start the timer again and that gives a warning.
7. If another endpoint whose QH is QH4 unlink is called. If QH4 position
is after async list's broken position, we hit into NULL pointer deference.

If at all the above scenario is possible, we have to fix the race
between watchdog timer function and IAAD interrupt handler. In other
words, we should not call end_unlink_async() prematurely. can we simply
return from end_unlink_async if QH state != QH_STATE_UNLINK to avoid
this scenario?

> The logic for all of this code was changed completely in the 3.6
> kernel.  You might want to try running the current -rc.
> 

Unfortunately, I don't have a system running 3.6 kernel . This issue is
reported on 3.0 and 3.4 kernels. I will definitely try this test when I
get a system running 3.6 kernel and let you know the results.

Thanks,
Pavan

-- 
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux