On Thu, 30 Sep 2021 00:48:55 +0300 Max Gurtovoy <mgurtovoy@xxxxxxxxxx> wrote: > On 9/29/2021 7:14 PM, Jason Gunthorpe wrote: > > On Wed, Sep 29, 2021 at 06:28:44PM +0300, Max Gurtovoy wrote: > > > >>> So you have a device that's actively modifying its internal state, > >>> performing I/O, including DMA (thereby dirtying VM memory), all while > >>> in the _STOP state? And you don't see this as a problem? > >> I don't see how is it different from vfio-pci situation. > > vfio-pci provides no way to observe the migration state. It isn't > > "000b" > > Alex said that there is a problem of compatibility. > > I migration SW is not involved, nobody will read this migration state. The _STOP state has a specific meaning regardless of whether userspace reads the device state value. I think what you're suggesting is that the device reports itself as _STOP'd but it's actually _RUNNING. Is that the compatibility workaround, create a self inconsistency? We cannot impose on userspace to move a device from _STOP to _RUNNING simply because the device supports the migration region, nor should we report a device state that is inconsistent with the actual device state. > >> Maybe we need to rename STOP state. We can call it READY or LIVE or > >> NON_MIGRATION_STATE. > > It was a poor choice to use 000b as stop, but it doesn't really > > matter. The mlx5 driver should just pre-init this readable to running. > > I guess we can do it for this reason. There is no functional problem nor > compatibility issue here as was mentioned. > > But still we need the kernel to track transitions. We don't want to > allow moving from RESUMING to SAVING state for example. How this > transition can be allowed ? > > In this case we need to fail the request from the migration SW... _RESUMING to _SAVING seems like a good way to test round trip migration without running the device to modify the state. Potentially it's a means to update a saved device migration data stream to a newer format using an intermediate driver version. If a driver is written such that it simply sees clearing the _RESUME bit as an indicator to de-serialize the data stream to the device, and setting the _SAVING flag as an indicator to re-serialize that data stream from the device, then this is just a means to make use of existing data paths. The uAPI specifies a means for drivers to reject a state change, but that risks failing to support a transition which might find mainstream use cases. I don't think common code should be responsible for filtering out viable transitions. Thanks, Alex