> -----Original Message----- > From: liulongfang <liulongfang@xxxxxxxxxx> > Sent: Monday, May 13, 2024 9:16 AM > To: Alex Williamson <alex.williamson@xxxxxxxxxx>; Shameerali Kolothum > Thodi <shameerali.kolothum.thodi@xxxxxxxxxx> > Cc: jgg@xxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; > kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > linuxarm@xxxxxxxxxxxxx > Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of > the XQC address > > On 2024/5/9 22:29, Alex Williamson wrote: > > On Thu, 9 May 2024 09:37:51 +0000 > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx> > wrote: > > > >>> -----Original Message----- > >>> From: Alex Williamson <alex.williamson@xxxxxxxxxx> > >>> Sent: Wednesday, May 8, 2024 7:00 PM > >>> To: liulongfang <liulongfang@xxxxxxxxxx> > >>> Cc: jgg@xxxxxxxxxx; Shameerali Kolothum Thodi > >>> <shameerali.kolothum.thodi@xxxxxxxxxx>; Jonathan Cameron > >>> <jonathan.cameron@xxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; linux- > >>> kernel@xxxxxxxxxxxxxxx; linuxarm@xxxxxxxxxxxxx > >>> Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register > location of > >>> the XQC address > >> > >> [...] > >> > >>>> HiSilicon accelerator equipment can perform general services after > >>> completing live migration. > >>>> This kind of business is executed through the user mode driver and only > >>> needs to use SQE and CQE. > >>>> > >>>> At the same time, this device can also perform kernel-mode services in > the > >>> VM through the crypto > >>>> subsystem. This kind of service requires the use of EQE. > >>>> > >>>> Finally, if the device is abnormal, the driver needs to perform a device > >>> reset, and AEQE needs to > >>>> be used in this case. > >>>> > >>>> Therefore, a complete device live migration function needs to ensure > that > >>> device functions are > >>>> normal in all these scenarios. > >>>> Therefore, this data still needs to be migrated. > >>> > >>> Ok, I had jumped to an in-kernel host driver in reference to "kernel > >>> mode" rather than a guest kernel. Migrating with bad data only affects > >>> the current configuration of the device, reloading a guest driver to > >>> update these registers or a reset of the device would allow proper > >>> operation of the device, correct? > >> > >> Yes, after talking to Longfang, the device RAS will trigger a reset and > >> would function after reset. > >> > >>> > >>> But I think this still isn't really a complete solution, we know > >>> there's a bug in the migration data stream, so not only would we fix > >>> the data stream, but I think we should also take measures to prevent > >>> loading a known bad data stream. AIUI migration of this device while > >>> running in kernel mode (ie. a kernel driver within a guest VM) is > >>> broken. Therefore, the least we can do in a new kernel, knowing that > >>> there was previously a bug in the migration data stream, is to fail to > >>> load that migration data because it risks this scenario where the > >>> device is broken after migration. Shouldn't we then also increment a > >>> migration version field in the data stream to block migrations that > >>> risk this breakage, or barring that, change the magic data field to > >>> prevent the migration? Thanks, > >> > >> Ok. We could add a new ACC_DEV_MAGIC_V2 and prevent the migration > >> in vf_qm_check_match(). The only concern here is that, it will completely > >> block old kernel to new kernel migration and since we can recover the > >> device after the reset whether it is too restrictive or not. > > > > What's the impact to the running driver, kernel or userspace, if the > > device is reset? Migration is intended to be effectively transparent > > If the device is reset, the user's task needs to be restarted. > If an exception has been detected, the best way is not to migrate. > > > to the driver. If the driver stalls and needs to reset the device, > > what has the migration driver accomplished versus an offline migration? > > > > If there's a way to detect from the migration data if the device is > > running in kernel mode or user mode then you could potentially accept > > and send v1 magic conditional that the device is in user mode and > > require v2 magic for any migration where the device is in kernel mode. > > This all adds complication though and seems like it has corner cases > > where we might allow migration to an old kernel that might trap the > > device there if the use case changes. > > > > The driver does not support checking whether the device is running in > kernel mode or user mode. > Moreover, the device supports user-mode services and kernel-mode services > to run at the same time. > > > Essentially it comes down to what should the migration experience be > > and while restricting old->new and new->old migration is undesirable, > > it seems old->old migration is effectively already broken anyway. As > > you consider a v2 magic, perhaps consider how the migration data > > structure might be improved overall to better handle new features and > > bugs. Thanks, > > > > We discussed a plan: > Update ACC_DEV_MAGIC to ACC_DEV_MAGIC_VERSION and configure its > last byte > as version information: > > /* QM match information, last byte is version number */ > #define ACC_DEV_MAGIC_VERSION 0XACCDEVFEEDCAFE01 Oops..cant have V there. But the idea is replace magic with last byte as version info which can be used in future for handling bugs/features etc. Thanks, Shameer