Re: [PATCH 1/1] vhost: Protect the virtqueue from being cleared whilst still in use

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Tue, 8 Mar 2022 06:06:47 -0500

On Tue, Mar 08, 2022 at 08:08:25AM +0000, Lee Jones wrote:
> On Tue, 08 Mar 2022, Jason Wang wrote:
> 
> > On Tue, Mar 8, 2022 at 3:18 AM Lee Jones <lee.jones@xxxxxxxxxx> wrote:
> > >
> > > vhost_vsock_handle_tx_kick() already holds the mutex during its call
> > > to vhost_get_vq_desc().  All we have to do here is take the same lock
> > > during virtqueue clean-up and we mitigate the reported issues.
> > >
> > > Also WARN() as a precautionary measure.  The purpose of this is to
> > > capture possible future race conditions which may pop up over time.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=279432d30d825e63ba00
> > >
> > > Cc: <stable@xxxxxxxxxxxxxxx>
> > > Reported-by: syzbot+adc3cb32385586bec859@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Signed-off-by: Lee Jones <lee.jones@xxxxxxxxxx>
> > > ---
> > >  drivers/vhost/vhost.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 59edb5a1ffe28..ef7e371e3e649 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -693,6 +693,15 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >         int i;
> > >
> > >         for (i = 0; i < dev->nvqs; ++i) {
> > > +               /* No workers should run here by design. However, races have
> > > +                * previously occurred where drivers have been unable to flush
> > > +                * all work properly prior to clean-up.  Without a successful
> > > +                * flush the guest will malfunction, but avoiding host memory
> > > +                * corruption in those cases does seem preferable.
> > > +                */
> > > +               WARN_ON(mutex_is_locked(&dev->vqs[i]->mutex));
> > > +
> > 
> > I don't get how this can help, the mutex could be grabbed in the
> > middle of the above and below line.
> 
> The worst that happens in this slim scenario is we miss a warning.
> The mutexes below will still function as expected and prevent possible
> memory corruption.

maybe. or maybe corruption already happened and this is the
fallout.

> > > +               mutex_lock(&dev->vqs[i]->mutex);
> > >                 if (dev->vqs[i]->error_ctx)
> > >                         eventfd_ctx_put(dev->vqs[i]->error_ctx);
> > >                 if (dev->vqs[i]->kick)
> > > @@ -700,6 +709,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >                 if (dev->vqs[i]->call_ctx.ctx)
> > >                         eventfd_ctx_put(dev->vqs[i]->call_ctx.ctx);
> > >                 vhost_vq_reset(dev, dev->vqs[i]);
> > > +               mutex_unlock(&dev->vqs[i]->mutex);
> > >         }
> > 
> > I'm not sure it's correct to assume some behaviour of a buggy device.
> > For the device mutex, we use that to protect more than just err/call
> > and vq.
> 
> When I authored this, I did so as *the* fix.  However, since the cause
> of today's crash has now been patched, this has become a belt and
> braces solution.  Michael's addition of the WARN() also has the
> benefit of providing us with an early warning system for future
> breakages.  Personally, I think it's kinda neat.
> 
> -- 
> Lee Jones [李琼斯]
> Principal Technical Lead - Developer Services
> Linaro.org │ Open source software for Arm SoCs
> Follow Linaro: Facebook | Twitter | Blog