Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 20 Oct 2021 13:46:29 -0300

On Wed, Oct 20, 2021 at 11:46:07AM +0300, Yishai Hadas wrote:

> What is the expectation for a reasonable delay ? we may expect this system
> WQ to run only short tasks and be very responsive.

If the expectation is that qemu will see the error return and the turn
around and issue FLR followed by another state operation then it does
seem strange that there would be a delay.

On the other hand, this doesn't seem that useful. If qemu tries to
migrate and the device fails then the migration operation is toast and
possibly the device is wrecked. It can't really issue a FLR without
coordinating with the VM, and it cannot resume the VM as the device is
now irrecoverably messed up.

If we look at this from a RAS perspective would would be useful here
is a way for qemu to request a fail safe migration data. This must
always be available and cannot fail.

When the failsafe is loaded into the device it would trigger the
device's built-in RAS features to co-ordinate with the VM driver and
recover. Perhaps qemu would also have to inject an AER or something.

Basically instead of the device starting in an "empty ready to use
state" it would start in a "failure detected, needs recovery" state.

Not hitless, but preserves overall availability vs a failed migration
== VM crash.

That said, it is just a thought, and I don't know if anyone has put
any resources into what to do if migration operations fail right now.

But failure is possible, ie the physical device could have crashed and
perhaps the migration is to move the VMs off the broken HW. In this
scenario all the migration operations will timeout and fail in the
driver.

However, since the guest VM could issue a FLR at any time, we really
shouldn't have this kind of operation floating around in the
background. Things must be made deterministic for qemu.

eg if qemu gets a guest request for FLR during the pre-copy stage it
really should abort the pre-copy, issue the FLR and then restart the
migration. I think it is unresonable to ask a device to be able to
maintain pre-copy across FLR.

To make this work the restarting of the migration must not race with a
schedule work wiping out all the state.

So, regrettably, something is needed here.

Ideally more of this logic would be in shared code, but I'm not sure I
have a good feeling what that should look like at this
point. Something to attempt once there are a few more implementations.

For instance the if predicate ladder I mentioned in the last email
should be shared code, not driver core as it is fundamental ABI.

Jason