Re: [PATCH 1/3] usb: dwc3: gadget: Prevent losing events in event cache

Felipe Balbi <balbi@xxxxxxxxxx> · Tue, 11 Apr 2017 10:44:39 +0300

Hi,

John Youn <John.Youn@xxxxxxxxxxxx> writes:
>> Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> writes:
>>> The dwc3 driver can overwite its previous events if its top half IRQ
>>> handler gets invoked again before processing the events in the cache. We
>>
>> interrupts are masked, why would top half get invoked again? Is this,
>> perhaps, related to DWC3 3.00a which has the "Interrupt line doesn't
>> lower when masked" problem? We've added a lot of code to workaround that
>> problem and, apparently, it wasn't enough.
>
> No, it is not related to that. We verified with PCIe traces. The
> interrupt line gets deasserted after we mask it. And we put the
> masking as close to the beginning of the top-half as possible.
>
>>
>> In any case, there's no way top half would be invoked again in a
>> properly working DWC3.
>
> Yet we still see it sometimes. Usually it doesn't create a problem,

that's fair, but it's not for the reason you're describing :-) There
might be another problem going on, because since we masked the interrupt
and cleared all events, IRQ shouldn't be raised at all; unless, as I
mentioned on the other subthread, the IRQ line is shared.

> but if there happens to be a new event there, we get the failure.
>
> We didn't trace into that part of the kernel so we can't explain why.
> But if there is any chance the interrupt line deassertion wasn't
> detected in time, whatever part of the kernel that thinks it is still
> asserted might just call our top-half again. This could be a totally
> wrong assumption, but it doesn't seem too far-fetched.

The kernel doesn't detect IRQ line assertion/deassertion. CPU gets an
exception when that happens and calls Kernel IRQ handler vector. That
will, in turn, figure out which line triggered, call the handler and so
on.

> If you have any insight on how to debug this, we can investigate this
> more.

see the other subthread :-)

>>> see this as a hang in the file transfer and the host will attempt to
>>> reset the device. This shouldn't be possible in the dwc3 implementation
>>> if the HW and the SW are in sync. However, through testing and reading
>>
>> it's not about SW and HW being in sync, it's about the HW being quirky,
>> right?
>
> No, we don't see any problem in the hardware. Here, being out of
> "sync" just means the device hangs because of lost events. There are a
> number of pretty bad failure scenarios that could happen due to this.
>
>>
>>> the pcie trace, the top half IRQ handler occasionally still gets invoked
>>> one more time after HW interrupt deassertion. We suspect that there is a
>>> small detection delay in the SW.
>>
>> so you don't really know what's going on :-) Why send a patch if you
>> don't understand the problem fully?
>>
>>> Top half IRQ handler deasserts the interrupt line after it gets the
>>> event count by writing DWC3_GEVNTSIZ_INTMASK to DWC3_GEVNTSIZ. There's a
>>> small window for a new event coming in between reading the event count
>>> and the interrupt deassertion. More generally, we will see 0 event
>>> count, which should not affect anything.
>>
>> this new event can come, and it'll remain in the event buffer. We have
>> masked the IRQ line and this should be enough to make sure IRQ isn't
>> generated again. Why is the IRQ raising again? Which kernel version are
>> you testing? Which DWC3 revision? Do you have interrupt moderation?
>
> It's not about the new event. We see the case where the top-half is
> called again just after we mask the interrupt without any events too.
> That's probably the majority of the time it occurs. It's just that if
> this case happens to coincide with a new event, everything gets
> screwed up.
>
> An alternative fix is to not allow the top-half to overwrite the event
> cache. Just append and update it. The top and bottom are already
> mutually exclusive so it should be fine.

Right, we _can_ do that as further protection for the problem. But first
we need to really _KNOW_ why this is happening. Then provide a minimal
fix which can be backported to stable and only after the minimal fix, do
we improve the code (perhaps even add comments in the code explaining
this situation so we never lose track of it again) with further
protections.

-- 
balbi
Attachment:
signature.asc

Description: PGP signature