On Fri, Feb 1, 2019 at 4:31 PM Thinh Nguyen <thinh.nguyen@xxxxxxxxxxxx> wrote: > > Hi John, > > John Stultz wrote: > > Hey all, > > Since the 5.0 merge window opened, I've been tripping on frequent > > dwc3 crashes on reboot and suspend, which I've added an example to the > > bottom of this mail. > > > > I've dug in a little bit and sort of have a sense of whats going on. > > > > In ffs_epfile_io(): > > https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_tree_drivers_usb_gadget_function_f-5Ffs.c-23n1065&d=DwIBaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=u9FYoxKtyhjrGFcyixFYqTjw1ZX0VsG2d8FCmzkTY-w&m=a8TU-itM8GBG_EARYf2yM-kVfCzmaPkKDNAUFQHTe3Q&s=BQiVAFiViSlxVg5_LemED0x_47FLVUD43M7R6h6T8qk&e= > > > > The completion done is setup on the stack: > > DECLARE_COMPLETION_ONSTACK(done); > > > > Then later we setup a request and queue it: > > req->context = &done; > > ... > > ret = usb_ep_queue(ep->ep, req, GFP_ATOMIC); > > > > Then wait for it: > > if (unlikely(wait_for_completion_interruptible(&done))) { > > /* > > * To avoid race condition with ffs_epfile_io_complete, > > * dequeue the request first then check > > * status. usb_ep_dequeue API should guarantee no race > > * condition with req->complete callback. > > */ > > usb_ep_dequeue(ep->ep, req); > > interrupted = ep->status < 0; > > } > > > > The problem is, that we end up being interrupted, supposedly dequeue > > the request, and exit. > > > > But then (or in parallel) the irq triggers and we try calling > > complete() on the context pointer which points to now random stack > > space, which results in the panic. > > > > It seems like something is wrong with usb_ep_dequeue not really > > stopping the irq from happening? > > > > If I revert all the changes to dwc3 back to 4.20, I don't see the issue. > > > > I'll do some bisection to try to narrow things down, but I wanted to > > see if this was a known issue or if anyone had immediate ideas as to > > what might be wrong. > > > > I'm not sure if this is related, but can you try to test using Felipe's > testing/next branch? There is a fix to a race condition when the gadget > driver tries to dequeue requests. > > See if you run into this issue again. I'll check that out! Thanks so much for the pointer! -john