On Fri, Aug 04, 2023 at 10:34:18AM -0700, Brett Creeley wrote: > > > On 8/4/2023 10:18 AM, Jason Gunthorpe wrote: > > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. > > > > > > On Tue, Jul 25, 2023 at 02:40:24PM -0700, Brett Creeley wrote: > > > It's possible that the device firmware crashes and is able to recover > > > due to some configuration and/or other issue. If a live migration > > > is in progress while the firmware crashes, the live migration will > > > fail. However, the VF PCI device should still be functional post > > > crash recovery and subsequent migrations should go through as > > > expected. > > > > > > When the pds_core device notices that firmware crashes it sends an > > > event to all its client drivers. When the pds_vfio driver receives > > > this event while migration is in progress it will request a deferred > > > reset on the next migration state transition. This state transition > > > will report failure as well as any subsequent state transition > > > requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of > > > VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this > > > reset is done, the migration state will be reset to > > > VFIO_DEVICE_STATE_RUNNING and migration can be performed. > > > > Have you actually tested this? Does the qemu side respond properly if > > this happens during a migration? > > > > Jason > > Yes, this has actually been tested. It's not necessary clean as far as the > log messages go because the driver may still be getting requests (i.e. dirty > log requests), but the noise should be okay because this is a very rare > event. > > QEMU does respond properly and in the manner I mentioned above. But what actually happens? QEMU aborts the migration and FLRs the device and then the VM has a totally trashed PCI function? Can the VM recover from this? Jason