Re: [PATCH v4 0/2] vfio/mdev: Device namespace protection

Cornelia Huck <cohuck@xxxxxxxxxx> · Wed, 23 May 2018 15:34:03 +0200

On Wed, 23 May 2018 14:29:28 +0200
Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:

> On 05/23/2018 10:56 AM, Cornelia Huck wrote:
> > On Tue, 22 May 2018 12:38:29 -0600
> > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> >   
> >> On Tue, 22 May 2018 19:17:07 +0200
> >> Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:
> >>  
> >>>   From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
> >>> be fine with these changes. I'm however not too deeply involved with
> >>> the mdev framework, thus I don't feel comfortable r-b-ing. That results
> >>> in
> >>> Acked-by: Halil Pasic <pasic@xxxxxxxxxxxxx>
> >>> for both patches.
> >>>
> >>> While at it I have would like to ask about the semantics and intended
> >>> use of the mdev interfaces.
> >>>
> >>> static int vfio_ccw_sch_probe(struct subchannel *sch)
> >>> {
> >>>
> >>> /* HALIL: 8< Not so interesting stuff happens here. >8 */  
> >>
> >> This was interesting:
> >>
> >> 	private->state = VFIO_CCW_STATE_NOT_OPER;
> >>  
> >>>           ret = vfio_ccw_mdev_reg(sch);
> >>>           if (ret)
> >>>                   goto out_disable;
> >>> /*
> >>>    * HALIL:
> >>>    * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
> >>>    * is made available (it calls mdev_register_device()). For instance create will
> >>>    * attempt to decrement private->avail which is initialized below. I fail to
> >>>    * understand how is  this well synchronized.
> >>>    */
> >>>           INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
> >>>           atomic_set(&private->avail, 1);
> >>>           private->state = VFIO_CCW_STATE_STANDBY;
> >>>
> >>>           return 0;
> >>>
> >>> out_disable:
> >>>           cio_disable_subchannel(sch);
> >>> out_free:
> >>>           dev_set_drvdata(&sch->dev, NULL);
> >>>           kfree(private);
> >>>           return ret;
> >>> }
> >>>
> >>> Should not initialization  of go before mdev_register_device(), and then rolled
> >>> back if necessary if mdev_register_device() fails?
> >>>
> >>> In practice it does not seem very likely that userspace can trigger
> >>> mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
> >>> not be a practical problem. But I would like to understand how synchronization
> >>> is supposed to work.
> >>>
> >>> [Added Dong Jia, maybe he is also able to answer my question.]  
> >>
> >> vfio_ccw_mdev_create() requires that private->state is not
> >> VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state
> >> to this value before calling vfio_ccw_mdev_reg(), so a create should
> >> return -ENODEV if racing with parent registration.  Is there something
> >> else that I'm missing?  Thanks,
> >>  
> 
> 
> Disclaimer: I did not do much kernel work up until now. I still have
> much to learn.
> 
> I mostly agree with your analysis but I'm not sure if the conclusion should be
> 'and thus everything is good' or 'and thus indeed we do have a race, a
> poorly handled one'.

Let me throw in that there is more than one way to handle a race, and
one of them is to return an error if something happens at an
inconvenient time :)

> 
> One thing I'm not sure about is: can atomic_set(&private->avail, 1) and
> private->state = VFIO_CCW_STATE_STANDBY be perceived as reordered by
> e.g. some other cpu and thus vfio_ccw_mdev_create() or not. I tried to
> figure it out based on Documentation/atomic_t.txt but was not very successful.
> If these can be reordered we could observe -EPERM instead of -ENODEV, I
> think.

I don't think that matters (see below).

> 
> Furthermore from your analysis I deduce that the client code (I think mdev
> calls it vendor code) may rely on mdev_register_device() containing a
> (RELEASE) barrier. We use a mutex in there so the barrier is there. And
> the client code may rely on a (ACQUIRE) barrier before the create callback
> is called. That should also be true and was true in the past too again because
> of mutex usage.
> 
> 
> >> Alex  
> > 
> > No, I think your understanding is correct. We move the state from
> > NOT_OPER to STANDBY only after we're set up completely, so our create
> > callback will simply fail early with -ENODEV. This looks fine to me.
> >   
> 
> This -ENODEV looks strange to me. Which device does not exist?  The
> userspace were supposed to retry on this? It's not even -EAGAIN. Is it
> documented somewhere?

-ENODEV looks very reasonable if we consider a device in the NOT_OPER
state.

> 
> If it's unavoidable (which I don't see why) I would prefer -EAGAIN. I
> think throwing an -ENODEV at our userspace once in a blue moon (if ever)
> because that is the way we 'handle' races in our code instead of avoiding
> them is not very friendly.
> 
> And I'm not sure -EPERM is not possible (see my statement
> about reordering of the writes above).

I don't think the actual return code does matter in this case. User
space must be prepared for an error (and -ENODEV was even possible
before, see the discussion in the v3 thread.)

We're dealing with a hard to trigger corner case that is easily handled
by user space here: let's not overthink this.