Re: [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 10 Feb 2022 13:27:41 -0400

On Thu, Feb 10, 2022 at 09:48:11AM -0700, Alex Williamson wrote:

> Specifically, I suspect we can trigger this race if the VM reboots as
> we're initiating a migration in the STOP_COPY phase, but that's maybe
> less interesting if we expect the VM to be halted before the device
> state is stepped.  

Yes, STOP_COPY drivers like mlx5/acc are fine here inherently.

We have already restricted what device touches are allowed in
STOP_COPY, and this must include reset too. None of the two drivers
posted can tolerate a reset during the serialization step. 

mlx5 will fail the STOP_COPY FW command and I guess acc will 'tear'
its register reads and produce a corrupted state.

> More interesting might be how a PRE_COPY transition works relative
> to asynchronous VM resets triggering device resets.  Are we
> serializing all access to reset vs this DEVICE_FEATURE op or are we
> resorting to double checking the device state, and how do we plan to
> re-initiate migration states if a VM reset occurs during migration?
> Thanks,

The device will be in PRE_COPY with VCPUs running. An async reset will
be triggered in the guest, so the device returns to RUNNING and the
data_fd's immediately return an errno.

There are three ways qemu can observe this:

 1) it is actively using the data_fds, so it immediately gets an
    error and propogates it up, aborting the migration
    eg it is doing read(), poll(), iouring, etc.

 2) it is done with the PRE_COPY phase of the data_fd and is moving
    toward STOP_COPY.
    In this case the vCPU is halted and the SET_STATE to STOP_COPY
    will execute, without any race, either:
      PRE_COPY -> STOP_COPY (data_fd == -1)
      RUNNING -> STOP_COPY (data_fd != -1)
    The expected data_fd is detected in the WIP qemu patch, however it
    mishandles the error, we will fix it.

 3) it is aborting the PRE_COPY migration, closing the data_fd and
    doing SET_STATE to RUNNING. In which case it doesn't know the
    device was reset. close() succeeds and SET_STATE RUNNING -> RUNNING
    is a nop.

Today's qemu must abort the migration at this point and fully restart
it because it has no mechanism to serialize a 'discard all of this
device's PRE_COPY state up to here' tag.

Some future qemu could learn to do this and then the receiver would
discard already sent device state - by triggering reset and a new
RUNNING -> RESUMING on the receiving device. In this case qemu would
have a choice of:
  abort the entire migration
  restart just this device back to PRE_COPY
  stop the vCPUs and use STOP_COPY

In any case, qemu fully detects this race as a natural part of its
operations and knows with certainty when it commands to go to
STOP_COPY, with vCPUs halted, if the preceeding PRE_COPY state is
correct or not.

It is interesting you bring this up, I'm not sure this worked properly
with v1. It seems we have solved it, inadvertently even, by using the
basic logic of the FSM and FD.

Jason