* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > On Mon, 25 Oct 2021 17:34:01 +0100 > "Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> wrote: > > > * Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > > > [Cc +dgilbert, +cohuck] > > > > > > On Wed, 20 Oct 2021 11:28:04 +0300 > > > Yishai Hadas <yishaih@xxxxxxxxxx> wrote: > > > > > > > On 10/20/2021 2:04 AM, Jason Gunthorpe wrote: > > > > > On Tue, Oct 19, 2021 at 02:58:56PM -0600, Alex Williamson wrote: > > > > >> I think that gives us this table: > > > > >> > > > > >> | NDMA | RESUMING | SAVING | RUNNING | > > > > >> +----------+----------+----------+----------+ --- > > > > >> | X | 0 | 0 | 0 | ^ > > > > >> +----------+----------+----------+----------+ | > > > > >> | 0 | 0 | 0 | 1 | | > > > > >> +----------+----------+----------+----------+ | > > > > >> | X | 0 | 1 | 0 | > > > > >> +----------+----------+----------+----------+ NDMA value is either compatible > > > > >> | 0 | 0 | 1 | 1 | to existing behavior or don't > > > > >> +----------+----------+----------+----------+ care due to redundancy vs > > > > >> | X | 1 | 0 | 0 | !_RUNNING/INVALID/ERROR > > > > >> +----------+----------+----------+----------+ > > > > >> | X | 1 | 0 | 1 | | > > > > >> +----------+----------+----------+----------+ | > > > > >> | X | 1 | 1 | 0 | | > > > > >> +----------+----------+----------+----------+ | > > > > >> | X | 1 | 1 | 1 | v > > > > >> +----------+----------+----------+----------+ --- > > > > >> | 1 | 0 | 0 | 1 | ^ > > > > >> +----------+----------+----------+----------+ Desired new useful cases > > > > >> | 1 | 0 | 1 | 1 | v > > > > >> +----------+----------+----------+----------+ --- > > > > >> > > > > >> Specifically, rows 1, 3, 5 with NDMA = 1 are valid states a user can > > > > >> set which are simply redundant to the NDMA = 0 cases. > > > > > It seems right > > > > > > > > > >> Row 6 remains invalid due to lack of support for pre-copy (_RESUMING > > > > >> | _RUNNING) and therefore cannot be set by userspace. Rows 7 & 8 > > > > >> are error states and cannot be set by userspace. > > > > > I wonder, did Yishai's series capture this row 6 restriction? Yishai? > > > > > > > > > > > > It seems so, by using the below check which includes the > > > > !VFIO_DEVICE_STATE_VALID clause. > > > > > > > > if (old_state == VFIO_DEVICE_STATE_ERROR || > > > > !VFIO_DEVICE_STATE_VALID(state) || > > > > (state & ~MLX5VF_SUPPORTED_DEVICE_STATES)) > > > > return -EINVAL; > > > > > > > > Which is: > > > > > > > > #define VFIO_DEVICE_STATE_VALID(state) \ > > > > (state & VFIO_DEVICE_STATE_RESUMING ? \ > > > > (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) > > > > > > > > > > > > > >> Like other bits, setting the bit should be effective at the completion > > > > >> of writing device state. Therefore the device would need to flush any > > > > >> outbound DMA queues before returning. > > > > > Yes, the device commands are expected to achieve this. > > > > > > > > > >> The question I was really trying to get to though is whether we have a > > > > >> supportable interface without such an extension. There's currently > > > > >> only an experimental version of vfio migration support for PCI devices > > > > >> in QEMU (afaik), > > > > > If I recall this only matters if you have a VM that is causing > > > > > migratable devices to interact with each other. So long as the devices > > > > > are only interacting with the CPU this extra step is not strictly > > > > > needed. > > > > > > > > > > So, single device cases can be fine as-is > > > > > > > > > > IMHO the multi-device case the VMM should probably demand this support > > > > > from the migration drivers, otherwise it cannot know if it is safe for > > > > > sure. > > > > > > > > > > A config option to override the block if the admin knows there is no > > > > > use case to cause devices to interact - eg two NVMe devices without > > > > > CMB do not have a useful interaction. > > > > > > > > > >> so it seems like we could make use of the bus-master bit to fill > > > > >> this gap in QEMU currently, before we claim non-experimental > > > > >> support, but this new device agnostic extension would be required > > > > >> for non-PCI device support (and PCI support should adopt it as > > > > >> available). Does that sound right? Thanks, > > > > > I don't think the bus master support is really a substitute, tripping > > > > > bus master will stop DMA but it will not do so in a clean way and is > > > > > likely to be non-transparent to the VM's driver. > > > > > > > > > > The single-device-assigned case is a cleaner restriction, IMHO. > > > > > > > > > > Alternatively we can add the 4th bit and insist that migration drivers > > > > > support all the states. I'm just unsure what other HW can do, I get > > > > > the feeling people have been designing to the migration description in > > > > > the header file for a while and this is a new idea. > > > > > > I'm wondering if we're imposing extra requirements on the !_RUNNING > > > state that don't need to be there. For example, if we can assume that > > > all devices within a userspace context are !_RUNNING before any of the > > > devices begin to retrieve final state, then clearing of the _RUNNING > > > bit becomes the device quiesce point and the beginning of reading > > > device data is the point at which the device state is frozen and > > > serialized. No new states required and essentially works with a slight > > > rearrangement of the callbacks in this series. Why can't we do that? > > > > So without me actually understanding your bit encodings that closely, I > > think the problem is we have to asusme that any transition takes time. > > From the QEMU point of view I think the requirement is when we stop the > > machine (vm_stop_force_state(RUN_STATE_FINISH_MIGRATE) in > > migration_completion) that at the point that call returns (with no > > error) all devices are idle. That means you need a way to command the > > device to go into the stopped state, and probably another to make sure > > it's got there. > > In a way. We're essentially recognizing that we cannot stop a single > device in isolation of others that might participate in peer-to-peer > DMA with that device, so we need to make a pass to quiesce each device > before we can ask the device to fully stop. This new device state bit > is meant to be that quiescent point, devices can accept incoming DMA > but should cease to generate any. Once all device are quiesced then we > can safely stop them. It may need some further refinement; for example in that quiesed state do counters still tick? will a NIC still respond to packets that don't get forwarded to the host? Note I still think you need a way to know when you have actually reached these states; setting a bit in a register is asking nicely for a device to go into a state - has it got there? > > Now, you could be a *little* more sloppy; you could allow a device carry > > on doing stuff purely with it's own internal state up until the point > > it needs to serialise; but that would have to be strictly internal state > > only - if it can change any other devices state (or issue an interrupt, > > change RAM etc) then you get into ordering issues on the serialisation > > of multiple devices. > > Yep, that's the proposal that doesn't require a uAPI change, we loosen > the definition of stopped to mean the device can no longer generate DMA > or interrupts and all internal processing outside or responding to > incoming DMA should halt (essentially the same as the new quiescent > state above). Once all devices are in this state, there should be no > incoming DMA and we can safely collect per device migration data. If > state changes occur beyond the point in time where userspace has > initiated the collection of migration data, drivers have options for > generating errors when userspace consumes that data. How do you know that last device has actually gone into that state? Also be careful; it feels much more delicate where something might accidentally start a transaction. > AFAICT, the two approaches are equally valid. If we modify the uAPI to > include this new quiescent state then userspace needs to make some hard > choices about what configurations they support without such a feature. > The majority of configurations are likely not exercising p2p between > assigned devices, but the hypervisor can't know that. If we work > within the existing uAPI, well there aren't any open source driver > implementations yet anyway and any non-upstream implementations would > need to be updated for this clarification. Existing userspace works > better with no change, so long as they already follow the guideline > that all devices in the userspace context must be stopped before the > migration data of any device can be considered valid. Thanks, Dave > Alex > -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK