Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/6/2024 6:29 AM, Thinh Nguyen wrote:
> On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
>> On 9/6/2024 4:48 AM, Thinh Nguyen wrote:
>>> On Fri, Sep 06, 2024, Selvarasu Ganesan wrote:
>>>> On 9/6/2024 2:43 AM, Thinh Nguyen wrote:
>>>>> On Thu, Sep 05, 2024, Selvarasu Ganesan wrote:
>>>>>> On 9/5/2024 5:56 AM, Thinh Nguyen wrote:
>>>>>>> On Wed, Sep 04, 2024, Selvarasu Ganesan wrote:
>>>>>>>> On 9/4/2024 6:33 AM, Thinh Nguyen wrote:
>>>>>>>>> On Mon, Sep 02, 2024, Selvarasu Ganesan wrote:
>>>>>>>>>> I would like to reconfirm from our end that in our failure scenario, we
>>>>>>>>>> observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
>>>>>>>>>> resume sequence is executed, and the dwc->pending_events flag is not
>>>>>>>>>> being set.
>>>>>>>>>>
>>>>>>>>> If the controller is stopped, no event is generated until it's restarted
>>>>>>>>> again. (ie, you should not see GEVNTCOUNT updated after clearing
>>>>>>>>> DCTL.run_stop). If there's no event, no interrupt assertion should come
>>>>>>>>> from the controller.
>>>>>>>>>
>>>>>>>>> If the pending_events is not set and you still see this failure, then
>>>>>>>>> likely that the controller had started, and the interrupt is generated
>>>>>>>>> from the controller event. This occurs along with the interrupt
>>>>>>>>> generated from your connection notification from your setup.
>>>>>>>> I completely agree. My discussion revolves around the handling of the
>>>>>>>> DWC3_EVENT_PENDING flag in all situations. The purpose of using this
>>>>>>>> flag is to prevent the processing of new events if an existing event is
>>>>>>>> still being processed. This flag is set in the top-half interrupt
>>>>>>>> handler and cleared at the end of the bottom-half handler.
>>>>>>>>
>>>>>>>> Now, let's consider scenarios where the bottom half is not scheduled,
>>>>>>>> and a USB reconnect occurs. In this case, there is a possibility that
>>>>>>>> the interrupt line is unmasked in dwc3_event_buffers_setup, and the USB
>>>>>>>> controller begins posting new events. The top-half interrupt handler
>>>>>>>> checks for the DWC3_EVENT_PENDING flag and returns IRQ_HANDLED without
>>>>>>>> processing any new events. However, the USB controller continues to post
>>>>>>>> interrupts until they are acknowledged.
>>>>>>>>
>>>>>>>> Please review the complete sequence once with DWC3_EVENT_PENDING flag.
>>>>>>>>
>>>>>>>> My proposal is to clear or reset the DWC3_EVENT_PENDING flag when
>>>>>>>> unmasking the interrupt line dwc3_event_buffers_setup, apart from
>>>>>>>> bottom-half handler. Clearing the DWC3_EVENT_PENDING flag in
>>>>>>>> dwc3_event_buffers_setup does not cause any harm, as we have implemented
>>>>>>>> a temporary workaround in our test setup to prevent IRQ storms.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Working scenarios:
>>>>>>>> ==================
>>>>>>>> 1. Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>              return IRQ_HANDLED;
>>>>>>>>          b. Set DWC3_EVENT_PENDING flag
>>>>>>>>          c. Masking interrupt line
>>>>>>>>
>>>>>>>> 2. Bottom-half handler:
>>>>>>>>          a. Un-masking interrupt line
>>>>>>>>          b. Clear DWC3_EVENT_PENDING flag
>>>>>>>>
>>>>>>>> Failure scenarios:
>>>>>>>> ==================
>>>>>>>> 1. Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>          b. Set DWC3_EVENT_PENDING flag
>>>>>>>>          c. Masking interrupt line
>>>>>>> For DWC3_EVENT_PENDING flag to be set at this point (before we start the
>>>>>>> controller), that means that the GEVNTCOUNT was not 0 after
>>>>>>> soft-disconnect and that the pm_runtime_suspended() must be false.
>>>>>> In the top-half code where we set the DWC3_EVENT_PENDING flag, we
>>>>>> acknowledge GEVNTCOUNT. Therefore, I think it is not necessary for
>>>>>> GEVNTCOUNT to have a non-zero value until a new event occurs. In fact,
>>>>>> when we tried to print GEVNTCOUNT in a non-interrupt context, we found
>>>>>> that it was zero, where we received DWC3_EVENT_PENDING being set in
>>>>>> non-interrupt context.
>>>>> For DWC3_EVENT_PENDING to be set, GEVNTCOUNT must be non-zero. If you
>>>>> see it's zero, that means that it was already decremented by the driver.
>>>>>
>>>>> If the driver acknowledges the GEVNTCOUNT, then that means that the
>>>>> events are copied and prepared to be processed. The bottom-half thread
>>>>> is scheduled. If it's for stale event, I don't want it to be processed.
>>>>>
>>>>>>>> 2. No Bottom-half scheduled:
>>>>>>> Why is the bottom-half not scheduled? Or do you mean it hasn't woken up
>>>>>>> yet before the next top-half coming?
>>>>>> In very rare cases, it is possible in our platform that the CPU may not
>>>>>> be able to schedule the bottom half of the dwc3 interrupt because a work
>>>>>> queue lockup has occurred on the same CPU that is attempting to schedule
>>>>>> the dwc3 thread interrupt. In this case Yes, the bottom-half handler
>>>>>> hasn't woken up, then initiate an IRQ storm for new events after the
>>>>>> controller restarts, resulting in no more bottom-half scheduling due to
>>>>>> the CPU being stuck in processing continuous interrupts and return
>>>>>> IRQ_HANDLED by checking if (evt->flags & DWC3_EVENT_PENDING).
>>>>>>
>>>>>>>> 3. USB reconnect: dwc3_event_buffers_setup
>>>>>>>>          a. Un-masking interrupt line
>>>>>>> Do we know that the GEVNTCOUNT is non-zero before starting the
>>>>>>> controller again?
>>>>>> The GEVNTCOUNT value showing as zero that we confirmed by adding debug
>>>>>> message here.
>>>>>>>> 4. Continuous interrupts : Top-half handler:
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>>
>>>>>>>>          a. if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>>                      return IRQ_HANDLED;
>>>>>>>> .....
>>>>>>>>
>>>>>>>> .....
>>>>>>>>
>>>>>>>> .....
>>>>>>>>
>>>>>> Sure, I can try implementing the proposed code modifications in our
>>>>>> testing environment.
>>>>>>
>>>>>> But, I am uncertain about how these changes will effectively prevent an
>>>>>> IRQ storm when the USB controller sequence restarts with the
>>>>>> DWC3_EVENT_PENDING. The following code will only execute until the
>>>>>> DWC3_EVENT_PENDING is cleared, at which point the previous bottom-half
>>>>>> will not be scheduled.
>>>>>>
>>>>>> Please correct me if i am wrong in my above understanding.
>>>>> As I mentioned, I don't want DWC3_EVENT_PENDING flag to be set due to
>>>>> the stale event. I want to ignore and skip processing any stale event.
>>>>>
>>>>> The DWC3_EVENT_PENDING should not be set by the time
>>>>> dwc3_event_buffers_setup() is called.
>>>>>
>>>>> Specifically review this condition in my testing patch:
>>>>>
>>>>> 	/*
>>>>> 	 * If the controller is halted, the event count is stale/invalid. Ignore
>>>>> 	 * them. This happens if the interrupt assertion is from an out-of-band
>>>>> 	 * resume notification.
>>>>> 	 */
>>>>> 	if (!dwc->pullups_connected && count) {
>>>>> 		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>>>>> 		return IRQ_HANDLED;
>>>>> 	}
>>>>>
>>>>> Let me know if the condition matches with what's happening for your
>>>>> case.
>>>> Hi Thinh,
>>>>
>>>> Thanks for your continuous reviews and suggestions.
>>>>
>>>> The given condition also will not matches in our case.
>>>> As i mentioned in starting of this thread please refer once the below
>>>> link of older discussion for similar issue from Samsung..
>>>>
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230102050831.105499-1-jh0801.jung@xxxxxxxxxxx/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynpe7Ex6cU$
>>>>
>>>>
>>>> DWC3_EVENT_PENDING flags set when count is 0.
>>>> It means "There are no interrupts to handle.".
>>>>
>>>> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF883DBF1180 (
>>>> 	(void *) buf = 0xFFFFFFC00DBDD000 = end+0x337D000,
>>>> 	(void *) cache = 0xFFFFFF8839F54080,
>>>> 	(unsigned int) length = 0x1000,
>>>> 	(unsigned int) lpos = 0x0,
>>>> *(unsigned int) count = 0x0, (unsigned int) flags = 0x00000001,*
>>>> 	(dma_addr_t) dma = 0x00000008BD7D7000,
>>>> 	(struct dwc3 *) dwc = 0xFFFFFF8839CBC880,
>>>> 	(u64) android_kabi_reserved1 = 0x0),
>>> This is the info of the event buffer that was reset after the
>>> dwc3_event_buffers_setup(). I'm talking about the first time
>>> DWC3_EVENT_PENDING flag was set.
>> Yes, the buffer that was reset before as part of
>> dwc3_event_buffers_setup() is correct.
>> I agree on your new code changes in below will prevent setting
>> DWC3_EVENT_PENDING and avoid the bottom-half handler if the controller
>> is halted, and the event count is invalid.
>>
>> Are you suspecting that the DWC3_EVENT_PENDING flag was only set in this
>> scenario in our failure case?
> Base on the discussion so far, that's what I'm suspecting.
>
>> /*diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 89fc690fdf34..a525f7ea5886 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -4490,6 +4490,17 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
>>    		return IRQ_HANDLED;
>>    
>>    	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>> +
>> +	/*
>> +	 * If the controller is halted, the event count is stale/invalid. Ignore
>> +	 * them. This happens if the interrupt assertion is from an out-of-band
>> +	 * resume notification.
>> +	 */
>> +	if (!dwc->pullups_connected && count) {
>> +		dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
>> +		return IRQ_HANDLED;
>> +	}
>> +
>>    	count &= DWC3_GEVNTCOUNT_MASK;
>>    	if (!count)
>>    		return IRQ_NONE;
>>
>>> By the time the interrupt storm below occur, the count in the buffer is
>>> already zero'ed out.
>>>
>>>> IRQ Storm:
>>>> (time = 47557628930999, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628931268, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> (time = 47557628932383, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628932652, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> (time = 47557628933768, irq = 165, fn = dwc3_interrupt, latency = 0, en = 1),
>>>> (time = 47557628934037, irq = 165, fn = dwc3_interrupt, latency = 0, en = 3),
>>>> ...
>>>> ...
>>>> ...
>>>>
>>>>
>>>> We are also fine with below code changes as you suggested earlier.
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20230109190914.3blihjfjdcszazdd@xxxxxxxxxxxx/__;!!A4F2R9G_pg!a3VpPHvMr9enk0YPjSoWJ12Kr5Hw2Ka43Q_wi80lw6ty2tJT4hKRKsCnQNdqbVS3JORK2VwqdoXDWz1q8ynp367zvEw$
>>>>
>>>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>>>> index 65500246323b..3c36dfdb88f0 100644
>>>> --- a/drivers/usb/dwc3/gadget.c
>>>> +++ b/drivers/usb/dwc3/gadget.c
>>>> @@ -5515,8 +5515,15 @@ static irqreturn_t dwc3_check_event_buf(struct
>>>> dwc3_event_buffer *evt)
>>>>             * irq event handler completes before caching new event to prevent
>>>>             * losing events.
>>>>             */
>>>> -       if (evt->flags & DWC3_EVENT_PENDING)
>>>> +       if (evt->flags & DWC3_EVENT_PENDING) {
>>>> +               if (!evt->count) {
>>>> +                       u32 reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0));
>>>> +
>>>> +                       if (!(reg & DWC3_GEVNTSIZ_INTMASK))
>>>> +                               evt->flags &= ~DWC3_EVENT_PENDING;
>>>> +               }
>>>>                    return IRQ_HANDLED;
>>>> +       }
>>>>
>>>>
>>> I don't want the bottom-half to be scheduled in the beginning as it may
>>> come before the cleanup in dwc3_event_buffers_setup().
>> You mean the above changes for clearing DWC3_EVENT_PENDINGnot required
>> as you given new change will prevent setting of DWC3_EVENT_PENDING
>> before dwc3_event_buffers_setup().
> Yes.
>
>> But I dont see any harm in above code changes for clearing
>> DWC3_EVENT_PENDING if it already set with evt->count=0.
> You can add it there, but do we need to if we can solve the actual
> issue?
Agree. Yes, we need to find actual issue.
>
> I'm interested in confirming my suspiction of what's really causing the
> DWC3_EVENT_PENDING to be set here. The code logic would be clearer
> rather than masking the behavior by depending on the reset by the
> dwc3_event_buffers_setup(). The runtime resume doesn't share the same
> locking mechanism as when processing an event. While it may be unlikely,
> I don't want the bottom-half thread to handle stale event or race with
> the runtime resume.

Hi Thinh,

I ran the code you recommended on our testing environment and was able 
to reproduce the issue one time.

When evt->flags contains DWC3_EVENT_PENDING, I've included the following 
debugging information: I added this debug message at the start of 
dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in 
during suspend and resume.

The results were quite interesting . I'm curious to understand how 
evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is 
equal to 0x1000 during the suspend.
Its means that the previous bottom-half handler prior to suspend might 
still be executing in the middle of the process.

Could you please give your suggestions here? And let me know if anything 
want to test or additional details are required.


##DBG: dwc3_event_buffers_cleanup:
  evt->length    :0x1000
  evt->lpos      :0x20c
  evt->count     :0x0
  evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408): 
0x00001000. Its means that previous bottom-half handler may be still 
running in middle

  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
  DWC3_DCFG(0xc700)              : 0x00e008a8
  DWC3_DCTL(0xc704)              : 0x0cf00a00
  DWC3_DEVTEN(0xc708)            : 0x00000000
  DWC3_DSTS(0xc70c)              : 0x00d20cd1


##DBG: dwc3_event_buffers_setup:
  evt->length    :0x1000
  evt->lpos      :0x20c
  evt->count     :0x0
  evt->flags     :0x1 // Still It's not clearing in during resume.

  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
  DWC3_DCFG(0xc700)              : 0x00080800
  DWC3_DCTL(0xc704)              : 0x00f00000
  DWC3_DEVTEN(0xc708)            : 0x00000000
  DWC3_DSTS(0xc70c)              : 0x00d20001

>
>> Anyway I will try the your new proposed changes alone on our testing
>> setup and will update the status,
>>
> Thanks,
> Thinh




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux