Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
> 
> On 9/6/2024 4:48 AM, Thinh Nguyen wrote:
> > On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
> >> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
> >>> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
> >>>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
> >>>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
> >>>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
> >>>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
> >>>>>>>> I would like to reconfirm from our end that in our failure scenario, we
> >>>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
> >>>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
> >>>>>>>> being set.
> >>>>>>>>
> >>>>>>> If the controller is stopped, no event is generated until it's restarted
> >>>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
> >>>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
> >>>>>>> from the controller.
> >>>>>>>
> >>>>>>> If the pending_events is not set and you still see this failure, then
> >>>>>>> likely that the controller had started, and the interrupt is generated
> >>>>>>> from the controller event. This occurs along with the interrupt
> >>>>>>> generated from your connection notification from your setup.
> >>>>>> I completely agree. My discussion revolves around the handling of the
> >>>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
> >>>>>> flag is to prevent the processing of new events if an existing event is
> >>>>>> still being processed. This flag is set in the top-half interrupt
> >>>>>> handler and cleared at the end of the bottom-half handler.
> >>>>>>
> >>>>>> Now, let's consider scenarios where the bottom half is not scheduled,
> >>>>>> and a USB reconnect occurs. In this case, there is a possibility that
> >>>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
> >>>>>> controller begins posting new events. The top-half interrupt handler
> >>>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
> >>>>>> processing any new events. However, the USB controller continues to post
> >>>>>> interrupts until they are acknowledged.
> >>>>>>
> >>>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
> >>>>>>
> >>>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
> >>>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
> >>>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
> >>>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
> >>>>>> a temporary workaround in our test setup to prevent IRQ storms.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Working scenarios:
> >>>>>> ==================
> >>>>>> 1. Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>             return IRQ_HANDLED;
> >>>>>>         b. Set DWC3_EVENT_PENDING flag
> >>>>>>         c. Masking interrupt line
> >>>>>>
> >>>>>> 2. Bottom-half handler:
> >>>>>>         a. Un-masking interrupt line
> >>>>>>         b. Clear DWC3_EVENT_PENDING flag
> >>>>>>
> >>>>>> Failure scenarios:
> >>>>>> ==================
> >>>>>> 1. Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>         b. Set DWC3_EVENT_PENDING flag
> >>>>>>         c. Masking interrupt line
> >>>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
> >>>>> controller), that means that the GEVNTCOUNT was not 0 after
> >>>>> soft-disconnect and that the pm_runtime_suspended() must be false.
> >>>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
> >>>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
> >>>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
> >>>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
> >>>> that it was zero, where we received DWC3_EVENT_PENDING being set in
> >>>> non-interrupt context.
> >>> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
> >>> see it's zero, that means that it was already decremented by the driver.
> >>>
> >>> If the driver acknowledges the GEVNTCOUNT, then that means that the
> >>> events are copied and prepared to be processed. The bottom-half thread
> >>> is scheduled. If it's for stale event, I don't want it to be processed.
> >>>
> >>>>>> 2. No Bottom-half scheduled:
> >>>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
> >>>>> yet before the next top-half coming?
> >>>> In very rare cases, it is possible in our platform that the CPU may not
> >>>> be able to schedule the bottom half of the dwc3 interrupt because a work
> >>>> queue lockup has occurred on the same CPU that is attempting to schedule
> >>>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
> >>>> hasn't woken up, then initiate an IRQ storm for new events after the
> >>>> controller restarts, resulting in no more bottom-half scheduling due to
> >>>> the CPU being stuck in processing continuous interrupts and return
> >>>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
> >>>>
> >>>>>> 3. USB reconnect: dwc3_event_buffers_setup
> >>>>>>         a. Un-masking interrupt line
> >>>>> Do we know that the GEVNTCOUNT is non-zero before starting the
> >>>>> controller again?
> >>>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
> >>>> message here.
> >>>>>> 4. Continuous interrupts : Top-half handler:
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>>
> >>>>>>         a. if (evt->flags & DWC3_EVENT_PENDING)
> >>>>>>                     return IRQ_HANDLED;
> >>>>>> .....
> >>>>>>
> >>>>>> .....
> >>>>>>
> >>>>>> .....
> >>>>>>
> >>>> Sure, I can try implementing the proposed code modifications in our
> >>>> testing environment.
> >>>>
> >>>> But, I am uncertain about how these changes will effectively prevent an
> >>>> IRQ storm when the USB controller sequence restarts with the
> >>>> DWC3_EVENT_PENDING. The following code will only execute until the
> >>>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
> >>>> will not be scheduled.
> >>>>
> >>>> Please correct me if i am wrong in my above understanding.
> >>> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
> >>> the stale event. I want to ignore and skip processing any stale event.
> >>>
> >>> The DWC3_EVENT_PENDING should not be set by the time
> >>> dwc3_event_buffers_setup() is called.
> >>>
> >>> Specifically review this condition in my testing patch:
> >>>
> >>> 	/*
> >>> 	 * If the controller is halted, the event count is stale/invalid. Ignore
> >>> 	 * them. This happens if the interrupt assertion is from an out-of-band
> >>> 	 * resume notification.
> >>> 	 */
> >>> 	if (!dwc->pullups_connected && count) {
> >>> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> >>> 		return IRQ_HANDLED;
> >>> 	}
> >>>
> >>> Let me know if the condition matches with what's happening for your
> >>> case.
> >> Hi Thinh,
> >>
> >> Thanks for your continuous reviews and suggestions.
> >>
> >> The given condition also will not matches in our case.
> >> As i mentioned in starting of this thread please refer once the below
> >> link of older discussion for similar issue from Samsung..
> >>
> >> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@xxxxxxxxxxx/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$
> >>
> >>
> >> DWC3_EVENT_PENDING flags set when count is 0.
> >> It means "There are no interrupts to handle.".
> >>
> >> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
> >> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
> >> 	(void *) cache = 0xFFFFFF8839F54080,
> >> 	(unsigned int) length = 0x1000,
> >> 	(unsigned int) lpos = 0x0,
> >> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
> >> 	(dma_addr_t) dma = 0x00000008BD7D7000,
> >> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
> >> 	(u64) android_kabi_reserved1 = 0x0),
> >
> > This is the info of the event buffer that was reset after the
> > dwc3_event_buffers_setup(). I'm talking about the first time
> > DWC3_EVENT_PENDING flag was set.
> 
> Yes, the buffer that was reset before as part of 
> dwc3_event_buffers_setup() is correct.
> I agree on your new code changes in below will prevent setting 
> DWC3_EVENT_PENDING and avoid the bottom-half handler if the controller 
> is halted, and the event count is invalid.
> 
> Are you suspecting that the DWC3_EVENT_PENDING flag was only set in this 
> scenario in our failure case?

Base on the discussion so far, that's what I'm suspecting.

> 
> /*diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> index 89fc690fdf34..a525f7ea5886 100644
> --- a/drivers/usb/dwc3/gadget.c
> +++ b/drivers/usb/dwc3/gadget.c
> @@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>   		return IRQ_HANDLED;
>   
>   	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +
> +	/*
> +	 * If the controller is halted, the event count is stale/invalid. Ignore
> +	 * them. This happens if the interrupt assertion is from an out-of-band
> +	 * resume notification.
> +	 */
> +	if (!dwc->pullups_connected && count) {
> +		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
> +		return IRQ_HANDLED;
> +	}
> +
>   	count &= DWC3_GEVNTCOUNT_MASK;
>   	if (!count)
>   		return IRQ_NONE;
> 
> >
> > By the time the interrupt storm below occur, the count in the buffer is
> > already zero'ed out.
> >
> >> IRQ Storm:
> >> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
> >> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
> >> ...
> >> ...
> >> ...
> >>
> >>
> >> We are also fine with below code changes as you suggested earlier.
> >> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@xxxxxxxxxxxx/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$
> >>
> >> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> >> index 65500246323b..3c36dfdb88f0 100644
> >> --- a/drivers/usb/dwc3/gadget.c
> >> +++ b/drivers/usb/dwc3/gadget.c
> >> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct
> >> dwc3_event_buffer *evt)
> >>            * irq event handler completes before caching new event to prevent
> >>            * losing events.
> >>            */
> >> -       if (evt->flags & DWC3_EVENT_PENDING)
> >> +       if (evt->flags & DWC3_EVENT_PENDING) {
> >> +               if (!evt->count) {
> >> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
> >> +
> >> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
> >> +                               evt->flags &= ~DWC3_EVENT_PENDING;
> >> +               }
> >>                   return IRQ_HANDLED;
> >> +       }
> >>
> >>
> > I don't want the bottom-half to be scheduled in the beginning as it may
> > come before the cleanup in dwc3_event_buffers_setup().
> You mean the above changes for clearing DWC3_EVENT_PENDINGnot required 
> as you given new change will prevent setting of DWC3_EVENT_PENDING 
> before dwc3_event_buffers_setup().

Yes.

> But I dont see any harm in above code changes for clearing 
> DWC3_EVENT_PENDING if it already set with evt->count=0.

You can add it there, but do we need to if we can solve the actual
issue?

I'm interested in confirming my suspiction of what's really causing the
DWC3_EVENT_PENDING to be set here. The code logic would be clearer
rather than masking the behavior by depending on the reset by the
dwc3_event_buffers_setup(). The runtime resume doesn't share the same
locking mechanism as when processing an event. While it may be unlikely,
I don't want the bottom-half thread to handle stale event or race with
the runtime resume.

> 
> Anyway I will try the your new proposed changes alone on our testing 
> setup and will update the status,
> 

Thanks,
Thinh




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux