On Fri, Feb 1, 2019 at 4:18 PM John Stultz <john.stultz@xxxxxxxxxx> wrote: > > Hey all, > Since the 5.0 merge window opened, I've been tripping on frequent > dwc3 crashes on reboot and suspend, which I've added an example to the > bottom of this mail. > > I've dug in a little bit and sort of have a sense of whats going on. > > In ffs_epfile_io(): > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/gadget/function/f_fs.c#n1065 > > The completion done is setup on the stack: > DECLARE_COMPLETION_ONSTACK(done); > > Then later we setup a request and queue it: > req->context = &done; > ... > ret = usb_ep_queue(ep->ep, req, GFP_ATOMIC); > > Then wait for it: > if (unlikely(wait_for_completion_interruptible(&done))) { > /* > * To avoid race condition with ffs_epfile_io_complete, > * dequeue the request first then check > * status. usb_ep_dequeue API should guarantee no race > * condition with req->complete callback. > */ > usb_ep_dequeue(ep->ep, req); > interrupted = ep->status < 0; > } > > The problem is, that we end up being interrupted, supposedly dequeue > the request, and exit. > > But then (or in parallel) the irq triggers and we try calling > complete() on the context pointer which points to now random stack > space, which results in the panic. > > It seems like something is wrong with usb_ep_dequeue not really > stopping the irq from happening? > > If I revert all the changes to dwc3 back to 4.20, I don't see the issue. > > I'll do some bisection to try to narrow things down, but I wanted to > see if this was a known issue or if anyone had immediate ideas as to > what might be wrong. Bisecting the changes down, it seems like its due to commit fec9095bdef4e ("usb: dwc3: gadget: remove wait_end_transfer"). It doesn't happen all the time, so I'll need to run some more testing, but so far I've not been able to trigger it backing out the patches to that point. thanks -john