On 8/4/2023 11:03 AM, Jason Gunthorpe wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
On Fri, Aug 04, 2023 at 10:34:18AM -0700, Brett Creeley wrote:
On 8/4/2023 10:18 AM, Jason Gunthorpe wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
On Tue, Jul 25, 2023 at 02:40:24PM -0700, Brett Creeley wrote:
It's possible that the device firmware crashes and is able to recover
due to some configuration and/or other issue. If a live migration
is in progress while the firmware crashes, the live migration will
fail. However, the VF PCI device should still be functional post
crash recovery and subsequent migrations should go through as
expected.
When the pds_core device notices that firmware crashes it sends an
event to all its client drivers. When the pds_vfio driver receives
this event while migration is in progress it will request a deferred
reset on the next migration state transition. This state transition
will report failure as well as any subsequent state transition
requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of
VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this
reset is done, the migration state will be reset to
VFIO_DEVICE_STATE_RUNNING and migration can be performed.
Have you actually tested this? Does the qemu side respond properly if
this happens during a migration?
Jason
Yes, this has actually been tested. It's not necessary clean as far as the
log messages go because the driver may still be getting requests (i.e. dirty
log requests), but the noise should be okay because this is a very rare
event.
QEMU does respond properly and in the manner I mentioned above.
But what actually happens?
QEMU aborts the migration and FLRs the device and then the VM has a
totally trashed PCI function?
Can the VM recover from this?
Jason
As it mentions above, the VM and PCI function do recover from this and
the subsequent migration works as expected.
Thanks,
Brett