Re: [PATCH 1/3] usb: dwc3: gadget: Prevent losing events in event cache

Felipe Balbi <balbi@xxxxxxxxxx> · Wed, 12 Apr 2017 09:13:36 +0300

Hi,

John Youn <John.Youn@xxxxxxxxxxxx> writes:
>>> John Youn <John.Youn@xxxxxxxxxxxx> writes:
>>>>> Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> writes:
>>>>>> The dwc3 driver can overwite its previous events if its top half IRQ
>>>>>> handler gets invoked again before processing the events in the cache. We
>>>>>
>>>>> interrupts are masked, why would top half get invoked again? Is this,
>>>>> perhaps, related to DWC3 3.00a which has the "Interrupt line doesn't
>>>>> lower when masked" problem? We've added a lot of code to workaround that
>>>>> problem and, apparently, it wasn't enough.
>>>>
>>>> No, it is not related to that. We verified with PCIe traces. The
>>>> interrupt line gets deasserted after we mask it. And we put the
>>>> masking as close to the beginning of the top-half as possible.
>>>>
>>>>>
>>>>> In any case, there's no way top half would be invoked again in a
>>>>> properly working DWC3.
>>>>
>>>> Yet we still see it sometimes. Usually it doesn't create a problem,
>>>
>>> that's fair, but it's not for the reason you're describing :-) There
>>> might be another problem going on, because since we masked the interrupt
>>> and cleared all events, IRQ shouldn't be raised at all; unless, as I
>>> mentioned on the other subthread, the IRQ line is shared.
>>>
>>>> but if there happens to be a new event there, we get the failure.
>>>>
>>>> We didn't trace into that part of the kernel so we can't explain why.
>>>> But if there is any chance the interrupt line deassertion wasn't
>>>> detected in time, whatever part of the kernel that thinks it is still
>>>> asserted might just call our top-half again. This could be a totally
>>>> wrong assumption, but it doesn't seem too far-fetched.
>>>
>>> The kernel doesn't detect IRQ line assertion/deassertion. CPU gets an
>>> exception when that happens and calls Kernel IRQ handler vector. That
>>> will, in turn, figure out which line triggered, call the handler and so
>>> on.
>>
>> We're talking about PCIe though, where interrupt assertion and
>> deassertion are packets. So I would imagine the kernel has to do
>> something and there could be some latency associated with that.
>
> Also, another thing is that the device uses legacy, level-triggered,
> PCIe interrupts, so for as long as the interrupt is asserted, the TH
> is called repeatedly.

yes, and that's why we have:

> static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
> {
> 	struct dwc3 *dwc = evt->dwc;
> 	u32 amount;
> 	u32 count;
> 	u32 reg;

> 	if (pm_runtime_suspended(dwc->dev)) {
> 		pm_runtime_get(dwc->dev);
> 		disable_irq_nosync(dwc->irq_gadget);
> 		dwc->pending_events = true;
> 		return IRQ_HANDLED;
> 	}
>
> 	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> 	count &= DWC3_GEVNTCOUNT_MASK;

check how many events are pending in the event buffer.

> 	if (!count)
> 		return IRQ_NONE;
>
> 	evt->count = count;
> 	evt->flags |= DWC3_EVENT_PENDING;
>
> 	/* Mask interrupt */
> 	reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
> 	reg |= DWC3_GEVNTSIZ_INTMASK;

mask interrupt generation

> 	dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), reg);
>
> 	amount = min(count, evt->length - evt->lpos);
> 	memcpy(evt->cache + evt->lpos, evt->buf + evt->lpos, amount);
>
> 	if (amount < count)
> 		memcpy(evt->cache, evt->buf, count - amount);
>
> 	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);

clear ALL events from event buffer. This brings the line down, so we
shouldn't re-enter.

> 	return IRQ_WAKE_THREAD;
> }

> So we mask the interrupt in the TH and a short time later, the
> interrupt de-assertion packet is sent on PCIe bus and if that's not
> seen right away we may already have another call to TH before the BH
> gets scheduled.

not sure this can happen. If that's the case, every PCI driver would
have all sorts of tricks to cope with this, not only dwc3 :-)

Bjorn, is this something that can happen on PCIe?

Quick summary of the problem:

John and Thinh are experiencing a re-entrant top-half handler even
though we have cleared pending IRQ status _and_ masked Interrupts. SNPS
is using an FPGA model of the latest DWC3 core under x86.

I have never seen this behavior on ARM or any of the x86 devices
containing this core (and this includes all the newest x86 cores, see
drivers/usb/dwc3/dwc3-pci.c for PCI IDs if you care enough :-)

Anyway, from my point of view, this is either a bug in IRQ subsystem
which only John and Thinh can reproduce at this moment, or a regression
with DWC3 IP Core :-s

-- 
balbi
Attachment:
signature.asc

Description: PGP signature