> From: Brett Creeley <bcreeley@xxxxxxx> > Sent: Saturday, August 5, 2023 2:51 AM > > On 8/4/2023 11:03 AM, Jason Gunthorpe wrote: > > Caution: This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > > > > > > On Fri, Aug 04, 2023 at 10:34:18AM -0700, Brett Creeley wrote: > >> > >> > >> On 8/4/2023 10:18 AM, Jason Gunthorpe wrote: > >>> Caution: This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > >>> > >>> > >>> On Tue, Jul 25, 2023 at 02:40:24PM -0700, Brett Creeley wrote: > >>>> It's possible that the device firmware crashes and is able to recover > >>>> due to some configuration and/or other issue. If a live migration > >>>> is in progress while the firmware crashes, the live migration will > >>>> fail. However, the VF PCI device should still be functional post > >>>> crash recovery and subsequent migrations should go through as > >>>> expected. > >>>> > >>>> When the pds_core device notices that firmware crashes it sends an > >>>> event to all its client drivers. When the pds_vfio driver receives > >>>> this event while migration is in progress it will request a deferred > >>>> reset on the next migration state transition. This state transition > >>>> will report failure as well as any subsequent state transition > >>>> requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of > >>>> VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once > this > >>>> reset is done, the migration state will be reset to > >>>> VFIO_DEVICE_STATE_RUNNING and migration can be performed. > >>> > >>> Have you actually tested this? Does the qemu side respond properly if > >>> this happens during a migration? > >>> > >>> Jason > >> > >> Yes, this has actually been tested. It's not necessary clean as far as the > >> log messages go because the driver may still be getting requests (i.e. dirty > >> log requests), but the noise should be okay because this is a very rare > >> event. > >> > >> QEMU does respond properly and in the manner I mentioned above. > > > > But what actually happens? > > > > QEMU aborts the migration and FLRs the device and then the VM has a > > totally trashed PCI function? > > > > Can the VM recover from this? > > > > Jason > > As it mentions above, the VM and PCI function do recover from this and > the subsequent migration works as expected. > If reset is requested by the host how is the VM notified to handle this undesired situation? Would it lead to observable application failures inside the guest after the recovery?