Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Mon, 25 Oct 2021 17:34:01 +0100

* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote:
> [Cc +dgilbert, +cohuck]
> 
> On Wed, 20 Oct 2021 11:28:04 +0300
> Yishai Hadas <yishaih@xxxxxxxxxx> wrote:
> 
> > On 10/20/2021 2:04 AM, Jason Gunthorpe wrote:
> > > On Tue, Oct 19, 2021 at 02:58:56PM -0600, Alex Williamson wrote:  
> > >> I think that gives us this table:
> > >>
> > >> |   NDMA   | RESUMING |  SAVING  |  RUNNING |
> > >> +----------+----------+----------+----------+ ---
> > >> |     X    |     0    |     0    |     0    |  ^
> > >> +----------+----------+----------+----------+  |
> > >> |     0    |     0    |     0    |     1    |  |
> > >> +----------+----------+----------+----------+  |
> > >> |     X    |     0    |     1    |     0    |
> > >> +----------+----------+----------+----------+  NDMA value is either compatible
> > >> |     0    |     0    |     1    |     1    |  to existing behavior or don't
> > >> +----------+----------+----------+----------+  care due to redundancy vs
> > >> |     X    |     1    |     0    |     0    |  !_RUNNING/INVALID/ERROR
> > >> +----------+----------+----------+----------+
> > >> |     X    |     1    |     0    |     1    |  |
> > >> +----------+----------+----------+----------+  |
> > >> |     X    |     1    |     1    |     0    |  |
> > >> +----------+----------+----------+----------+  |
> > >> |     X    |     1    |     1    |     1    |  v
> > >> +----------+----------+----------+----------+ ---
> > >> |     1    |     0    |     0    |     1    |  ^
> > >> +----------+----------+----------+----------+  Desired new useful cases
> > >> |     1    |     0    |     1    |     1    |  v
> > >> +----------+----------+----------+----------+ ---
> > >>
> > >> Specifically, rows 1, 3, 5 with NDMA = 1 are valid states a user can
> > >> set which are simply redundant to the NDMA = 0 cases.  
> > > It seems right
> > >  
> > >> Row 6 remains invalid due to lack of support for pre-copy (_RESUMING
> > >> | _RUNNING) and therefore cannot be set by userspace.  Rows 7 & 8
> > >> are error states and cannot be set by userspace.  
> > > I wonder, did Yishai's series capture this row 6 restriction? Yishai?  
> > 
> > 
> > It seems so,  by using the below check which includes the 
> > !VFIO_DEVICE_STATE_VALID clause.
> > 
> > if (old_state == VFIO_DEVICE_STATE_ERROR ||
> >          !VFIO_DEVICE_STATE_VALID(state) ||
> >          (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
> >          return -EINVAL;
> > 
> > Which is:
> > 
> > #define VFIO_DEVICE_STATE_VALID(state) \
> >      (state & VFIO_DEVICE_STATE_RESUMING ? \
> >      (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> > 
> > >  
> > >> Like other bits, setting the bit should be effective at the completion
> > >> of writing device state.  Therefore the device would need to flush any
> > >> outbound DMA queues before returning.  
> > > Yes, the device commands are expected to achieve this.
> > >  
> > >> The question I was really trying to get to though is whether we have a
> > >> supportable interface without such an extension.  There's currently
> > >> only an experimental version of vfio migration support for PCI devices
> > >> in QEMU (afaik),  
> > > If I recall this only matters if you have a VM that is causing
> > > migratable devices to interact with each other. So long as the devices
> > > are only interacting with the CPU this extra step is not strictly
> > > needed.
> > >
> > > So, single device cases can be fine as-is
> > >
> > > IMHO the multi-device case the VMM should probably demand this support
> > > from the migration drivers, otherwise it cannot know if it is safe for
> > > sure.
> > >
> > > A config option to override the block if the admin knows there is no
> > > use case to cause devices to interact - eg two NVMe devices without
> > > CMB do not have a useful interaction.
> > >  
> > >> so it seems like we could make use of the bus-master bit to fill
> > >> this gap in QEMU currently, before we claim non-experimental
> > >> support, but this new device agnostic extension would be required
> > >> for non-PCI device support (and PCI support should adopt it as
> > >> available).  Does that sound right?  Thanks,  
> > > I don't think the bus master support is really a substitute, tripping
> > > bus master will stop DMA but it will not do so in a clean way and is
> > > likely to be non-transparent to the VM's driver.
> > >
> > > The single-device-assigned case is a cleaner restriction, IMHO.
> > >
> > > Alternatively we can add the 4th bit and insist that migration drivers
> > > support all the states. I'm just unsure what other HW can do, I get
> > > the feeling people have been designing to the migration description in
> > > the header file for a while and this is a new idea.
> 
> I'm wondering if we're imposing extra requirements on the !_RUNNING
> state that don't need to be there.  For example, if we can assume that
> all devices within a userspace context are !_RUNNING before any of the
> devices begin to retrieve final state, then clearing of the _RUNNING
> bit becomes the device quiesce point and the beginning of reading
> device data is the point at which the device state is frozen and
> serialized.  No new states required and essentially works with a slight
> rearrangement of the callbacks in this series.  Why can't we do that?

So without me actually understanding your bit encodings that closely, I
think the problem is we have to asusme that any transition takes time.
>From the QEMU point of view I think the requirement is when we stop the
machine (vm_stop_force_state(RUN_STATE_FINISH_MIGRATE) in
migration_completion) that at the point that call returns (with no
error) all devices are idle.  That means you need a way to command the
device to go into the stopped state, and probably another to make sure
it's got there.

Now, you could be a *little* more sloppy; you could allow a device carry
on doing stuff purely with it's own internal state up until the point
it needs to serialise; but that would have to be strictly internal state
only - if it can change any other devices state (or issue an interrupt,
change RAM etc) then you get into ordering issues on the serialisation
of multiple devices.

Dave

> Maybe a clarification of the uAPI spec is sufficient to achieve this,
> ex. !_RUNNING devices may still update their internal state machine
> based on external access.  Userspace is expected to quiesce all external
> access prior to initiating the retrieval of the final device state from
> the data section of the migration region.  Failure to do so may result
> in inconsistent device state or optionally the device driver may induce
> a fault if a quiescent state is not maintained.
> 
> > Just to be sure,
> > 
> > We refer here to some future functionality support with this extra 4th 
> > bit but it doesn't enforce any change in the submitted code, right ?
> > 
> > The below code uses the (state & ~MLX5VF_SUPPORTED_DEVICE_STATES) clause 
> > which fails any usage of a non-supported bit as of this one.
> > 
> > if (old_state == VFIO_DEVICE_STATE_ERROR ||
> >          !VFIO_DEVICE_STATE_VALID(state) ||
> >          (state & ~MLX5VF_SUPPORTED_DEVICE_STATES))
> >          return -EINVAL;
> 
> Correct, userspace shouldn't be setting any extra bits unless we
> advertise support, such as via a capability or flag.  Drivers need to
> continue to sanitize user input to validate yet-to-be-defined bits are
> not accepted from userspace or else we risk not being able to define
> them later without breaking userspace.  Thanks,
> 
> Alex
> 
-- 
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK