Re: [PATCH v3] usb: gadget: u_ether: Use __netif_rx() in rx_callback()

Hubert Wiśniewski <hubert.wisniewski.25632@xxxxxxxxx> · Tue, 01 Oct 2024 16:06:57 +0200

On Fri, 2024-09-27 at 16:12 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-09-27 15:33:35 [+0200], Hubert Wiśniewski wrote:
> > On Thu, 2024-09-26 at 21:39 +0200, Hubert Wiśniewski wrote:
> > > I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx()
> > > (which is not being called in the interrupt context after all), yet
> > > replacing it with __netif_rx() fixes the lockup (though a warning is still
> > > generated, which suggests that the patch does not completely fix the
> > > issue).
> > 
> > Well, never mind. After some investigation, I think the problem is as
> > follows:
> > 
> > 1. musb_g_giveback() releases the musb lock using spin_unlock(). The lock
> > is now released, but hardirqs are still disabled.
> > 
> > 2. Then, usb_gadget_giveback_request() is called, which in turn calls
> > rx_complete(). This does not happen in the interrupt context, so netif_rx()
> > disables bottom havles, then enables them using local_bh_enable().
> > 
> > 3. This leads to calling __local_bh_enable_ip(), which gives off a warning
> > (the first backtrace) that hardirqs are disabled. Then, hardirqs are
> > disabled (again?), and then enabled (as they should have been in the first
> > place).
> > 
> > 4. After usb_gadget_giveback_request() returns, musb_g_giveback() acquires
> > the musb lock using spin_lock(). This does not disable hardirqs, so they
> > are still enabled.
> > 
> > 5. While the musb lock is acquired, an interrupt occurs. It is handled by
> > dsps_interrupt(), which acquires the musb lock. A deadlock occurs.
> 
> This all makes sense so far.

I have done more testing on this. It seems that this deadlock possibility
reported by lockdep is not the cause, but just a symptom.

For now, my conclusion is that the problem lies in the MUSB gadget driver
itself. Interrupts (in peripheral mode) on Rx endpoints are handled by
musb_g_rx(), which pulls requests from EP request queue. If there is no
request queued, it just returns without clearing the RXPKTRDY flag in the
RXCSR register (but the interrupt flag in the glue layer register has been
already cleared by the glue layer IRQ handler). This makes the received
packet wait for the next interrupt. If the Rx FIFO is full, no more packets
are received and no more interrupts are generated. The EP stays locked up
forever (or until the RXPKTRDY flag is cleared manually :)).