On Wed, 20 Oct 2021 13:46:29 -0300 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Wed, Oct 20, 2021 at 11:46:07AM +0300, Yishai Hadas wrote: > > > What is the expectation for a reasonable delay ? we may expect this system > > WQ to run only short tasks and be very responsive. > > If the expectation is that qemu will see the error return and the turn > around and issue FLR followed by another state operation then it does > seem strange that there would be a delay. > > On the other hand, this doesn't seem that useful. If qemu tries to > migrate and the device fails then the migration operation is toast and > possibly the device is wrecked. It can't really issue a FLR without > coordinating with the VM, and it cannot resume the VM as the device is > now irrecoverably messed up. > > If we look at this from a RAS perspective would would be useful here > is a way for qemu to request a fail safe migration data. This must > always be available and cannot fail. > > When the failsafe is loaded into the device it would trigger the > device's built-in RAS features to co-ordinate with the VM driver and > recover. Perhaps qemu would also have to inject an AER or something. > > Basically instead of the device starting in an "empty ready to use > state" it would start in a "failure detected, needs recovery" state. The "fail-safe recovery state" is essentially the reset state of the device. If a device enters an error state during migration, I would think the ultimate recovery procedure would be to abort the migration, send an AER to the VM, whereby the guest would trigger a reset, and the RAS capabilities of the guest would handle failing over to a multipath device, ejecting the failing device, etc. However, regardless of the migration recovery strategy, userspace needs a means to get the device back into an initial state in a deterministic way without closing and re-opening the device (or polling for an arbitrary length of time). That's the minimum viable product here. Thanks, Alex