Re: gadgetfs inode.c - possible memory corruption ?

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 7 Jul 2022 13:16:15 -0400

On Thu, Jul 07, 2022 at 06:47:09PM +0200, Jozo M. wrote:
> But can't be this a deadlock ?
> What if IRQ which calls complete(...) will be executed sooner than
> wait_for_completion()

It's okay.  If complete() gets called first then wait_for_completion() 
will return immediately.

Alan Stern

> 
> št 7. 7. 2022 o 18:11 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> napísal(a):
> >
> > On Thu, Jul 07, 2022 at 11:43:09AM +0200, Jozo M. wrote:
> > > Hello,
> > >
> > > my kernel running on imx6 was crashing on USB gadgetfs because my
> > > kernel was using wait_event API instead of completion (I was convinced
> > > it is due to wrong HW setup).
> > > During research gadgetfs inode.c function ep_io was not clear for me:
> > >
> > > we are submiting USB request here
> > >       value = usb_ep_queue (epdata->ep, req, GFP_ATOMIC);
> > > then we are waiting for completion here:
> > >       value = wait_for_completion_interruptible(&done);
> > > but if completion is interrupted we end up here:
> > >       spin_unlock_irq (&epdata->dev->lock);
> > >       DBG (epdata->dev, "endpoint gone\n");
> > >       epdata->status = -ENODEV;
> >
> > You left out part of the code.  We end up at this code in the case where
> > epdata->ep == NULL, and the only way that can happen is if the endpoint
> > was removed while we were waiting for the completion.
> >
> > > At this point ep_io is terminated and stack is not valid. Later on
> > > epio_complete might be called from IRQ and it calls complete ((struct
> > > completion *)req->context) but stack is no longer valid;
> > > Shouldn't we put req->context = NULL;  before spin_unlock_irq
> > > (&epdata->dev->lock); ?
> > >       req->context = NULL;
> > >       spin_unlock_irq (&epdata->dev->lock);
> > >       DBG (epdata->dev, "endpoint gone\n");
> > >       epdata->status = -ENODEV;
> >
> > You're right that this is a bug, but your solution is not correct.
> > There is a race: epio_complete can run at the same time as this code if
> > the endpoint is removed concurrently with the interrupt, and your
> > approach is still subject to this race.
> >
> > The right way to fix the problem is to call wait_for_completion(&done)
> > between the DBG and the assignment to epdata->status.  That way the
> > stack will still be valid when epio_complete runs.
> >
> > Please feel free to submit a patch doing this.
> >
> > Alan Stern