Re: [PATCH 1/3] usb: dwc3: gadget: Prevent losing events in event cache

Bjorn Helgaas <bhelgaas@xxxxxxxxxx> · Wed, 12 Apr 2017 10:04:58 -0500

On Wed, Apr 12, 2017 at 1:13 AM, Felipe Balbi <balbi@xxxxxxxxxx> wrote:
>
> Hi,
>
> John Youn <John.Youn@xxxxxxxxxxxx> writes:
>>>> John Youn <John.Youn@xxxxxxxxxxxx> writes:
>>>>>> Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> writes:
>>>>>>> The dwc3 driver can overwite its previous events if its top half IRQ
>>>>>>> handler gets invoked again before processing the events in the cache. We
>>>>>>
>>>>>> interrupts are masked, why would top half get invoked again? Is this,
>>>>>> perhaps, related to DWC3 3.00a which has the "Interrupt line doesn't
>>>>>> lower when masked" problem? We've added a lot of code to workaround that
>>>>>> problem and, apparently, it wasn't enough.
>>>>>
>>>>> No, it is not related to that. We verified with PCIe traces. The
>>>>> interrupt line gets deasserted after we mask it. And we put the
>>>>> masking as close to the beginning of the top-half as possible.
>>>>>
>>>>>>
>>>>>> In any case, there's no way top half would be invoked again in a
>>>>>> properly working DWC3.
>>>>>
>>>>> Yet we still see it sometimes. Usually it doesn't create a problem,
>>>>
>>>> that's fair, but it's not for the reason you're describing :-) There
>>>> might be another problem going on, because since we masked the interrupt
>>>> and cleared all events, IRQ shouldn't be raised at all; unless, as I
>>>> mentioned on the other subthread, the IRQ line is shared.
>>>>
>>>>> but if there happens to be a new event there, we get the failure.
>>>>>
>>>>> We didn't trace into that part of the kernel so we can't explain why.
>>>>> But if there is any chance the interrupt line deassertion wasn't
>>>>> detected in time, whatever part of the kernel that thinks it is still
>>>>> asserted might just call our top-half again. This could be a totally
>>>>> wrong assumption, but it doesn't seem too far-fetched.
>>>>
>>>> The kernel doesn't detect IRQ line assertion/deassertion. CPU gets an
>>>> exception when that happens and calls Kernel IRQ handler vector. That
>>>> will, in turn, figure out which line triggered, call the handler and so
>>>> on.
>>>
>>> We're talking about PCIe though, where interrupt assertion and
>>> deassertion are packets. So I would imagine the kernel has to do
>>> something and there could be some latency associated with that.
>>
>> Also, another thing is that the device uses legacy, level-triggered,
>> PCIe interrupts, so for as long as the interrupt is asserted, the TH
>> is called repeatedly.
>
> yes, and that's why we have:
>
>> static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>> {
>>       struct dwc3 *dwc = evt->dwc;
>>       u32 amount;
>>       u32 count;
>>       u32 reg;
>
>>       if (pm_runtime_suspended(dwc->dev)) {
>>               pm_runtime_get(dwc->dev);
>>               disable_irq_nosync(dwc->irq_gadget);
>>               dwc->pending_events = true;
>>               return IRQ_HANDLED;
>>       }
>>
>>       count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>>       count &= DWC3_GEVNTCOUNT_MASK;
>
> check how many events are pending in the event buffer.
>
>>       if (!count)
>>               return IRQ_NONE;
>>
>>       evt->count = count;
>>       evt->flags |= DWC3_EVENT_PENDING;
>>
>>       /* Mask interrupt */
>>       reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
>>       reg |= DWC3_GEVNTSIZ_INTMASK;
>
> mask interrupt generation
>
>>       dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), reg);
>>
>>       amount = min(count, evt->length - evt->lpos);
>>       memcpy(evt->cache + evt->lpos, evt->buf + evt->lpos, amount);
>>
>>       if (amount < count)
>>               memcpy(evt->cache, evt->buf, count - amount);
>>
>>       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>
> clear ALL events from event buffer. This brings the line down, so we
> shouldn't re-enter.
>
>>       return IRQ_WAKE_THREAD;
>> }
>
>> So we mask the interrupt in the TH and a short time later, the
>> interrupt de-assertion packet is sent on PCIe bus and if that's not
>> seen right away we may already have another call to TH before the BH
>> gets scheduled.
>
> not sure this can happen. If that's the case, every PCI driver would
> have all sorts of tricks to cope with this, not only dwc3 :-)
>
> Bjorn, is this something that can happen on PCIe?

I'm pretty sure that the device will send Deassert_INTx *after* the
driver tells the device to stop interrupting.  That should eventually
result in deassertion of the level-triggered interrupt.

I can't speak to the mechanics of interrupt masking and the IRQ
subsystem.  I do think all IRQ handlers should be prepared to handle
spurious interrupts gracefully.

> Quick summary of the problem:
>
> John and Thinh are experiencing a re-entrant top-half handler even
> though we have cleared pending IRQ status _and_ masked Interrupts. SNPS
> is using an FPGA model of the latest DWC3 core under x86.
>
> I have never seen this behavior on ARM or any of the x86 devices
> containing this core (and this includes all the newest x86 cores, see
> drivers/usb/dwc3/dwc3-pci.c for PCI IDs if you care enough :-)
>
> Anyway, from my point of view, this is either a bug in IRQ subsystem
> which only John and Thinh can reproduce at this moment, or a regression
> with DWC3 IP Core :-s
>
> --
> balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html