On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote: > On Wed, 29 Sep 2021 13:16:02 -0300 > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote: > > > On Tue, 28 Sep 2021 16:35:50 -0300 > > > Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote: > > > > > > > > > In defining the device state, we tried to steer away from defining it > > > > > in terms of the QEMU migration API, but rather as a set of controls > > > > > that could be used to support that API to leave us some degree of > > > > > independence that QEMU implementation might evolve. > > > > > > > > That is certainly a different perspective, it would have been > > > > better to not express this idea as a FSM in that case... > > > > > > > > So each state in mlx5vf_pci_set_device_state() should call the correct > > > > combination of (un)freeze, (un)quiesce and so on so each state > > > > reflects a defined operation of the device? > > > > > > I'd expect so, for instance the implementation of entering the _STOP > > > state presumes a previous state that where the device is apparently > > > already quiesced. That doesn't support a direct _RUNNING -> _STOP > > > transition where I argued in the linked threads that those states > > > should be reachable from any other state. Thanks, > > > > If we focus on mlx5 there are two device 'flags' to manage: > > - Device cannot issue DMAs > > - Device internal state cannot change (ie cannot receive DMAs) > > > > This is necessary to co-ordinate across multiple devices that might be > > doing peer to peer DMA between them. The whole multi-device complex > > should be moved to "cannot issue DMA's" then the whole complex would > > go to "state cannot change" and be serialized. > > Are you anticipating p2p from outside the VM? The typical scenario > here would be that p2p occurs only intra-VM, so all the devices would > stop issuing DMA (modulo trying to quiesce devices simultaneously). Inside the VM. Your 'modulo trying to quiesce devices simultaneously' is correct - this is a real issue that needs to be solved. If we put one device in a state where it's internal state is immutable it can no longer accept DMA messages from the other devices. So there are two states in the HW model - do not generate DMAs and finally the immutable internal state where even external DMAs are refused. > > The expected sequence at the device is thus > > > > Resuming > > full stop -> does not issue DMAs -> full operation > > Suspend > > full operation -> does not issue DMAs -> full stop > > > > Further the device has two actions > > - Trigger serializating the device state > > - Trigger de-serializing the device state > > > > So, what is the behavior upon each state: > > > > * 000b => Device Stopped, not saving or resuming > > Does not issue DMAs > > Internal state cannot change > > > > * 001b => Device running, which is the default state > > Neither flags > > > > * 010b => Stop the device & save the device state, stop-and-copy state > > Does not issue DMAs > > Internal state cannot change > > > > * 011b => Device running and save the device state, pre-copy state > > Neither flags > > (future, DMA tracking turned on) > > > > * 100b => Device stopped and the device state is resuming > > Does not issue DMAs > > Internal state cannot change > > cannot change... other that as loaded via migration region. Yes > The expected protocol is that if the user write to the device_state > register returns an errno, the user reevaluates the device_state to > determine if the desired transition is unavailable (previous state > value is returned) or generated a fault (error state value > returned). Hmm, interesting, mlx5 should be doing this as well. Eg resuming with corrupt state should fail and cannot be recovered except via reset. > The 101b state indicates _RUNNING while _RESUMING, which is simply not > a mode that has been spec'd at this time as it would require some > mechanism for the device to fault in state on demand. So lets error on these requests since we don't know what state to put the device into. > > The two actions: > > trigger serializing the device state > > Done when asked to go to 010b ? > > When the _SAVING bit is set. The exact mechanics depends on the size > and volatility of the device state. A GPU might begin in pre-copy > (011b) to transmit chunks of framebuffer data, recording hashes of > blocks read by the user to avoid re-sending them during the > stop-and-copy (010b) phase. Here I am talking specifically about mlx5 which does not have a state capture in pre-copy. So mlx5 should capture state on 010b only, and the 011b is a NOP. > > trigger de-serializing the device state > > Done when transition from 100b -> 000b ? > > 100b -> 000b is not a required transition, generally this would be 100b > -> 001b, ie. end state of _RUNNING vs _STOP. Sorry, I typo'd it, yes to _RUNNING > I think the requirement is that de-serialization is complete when the > _RESUMING bit is cleared. Whether the driver chooses to de-serialize > piece-wise as each block of data is written to the device or in bulk > from a buffer is left to the implementation. In either case, the > driver can fail the transition to !_RESUMING if the state is incomplete > or otherwise corrupt. It would again be the driver's discretion if > the device enters the error state or remains in _RESUMING. If the user > has no valid state with which to exit the _RESUMING phase, a device > reset should return the device to _RUNNING with a default initial state. That makes sense enough. > > There is a missing state "Stop Active Transactions" which would be > > only "does not issue DMAs". I've seen a proposal to add that. > > This would be to get all devices to stop issuing DMA while internal > state can be modified to avoid the synchronization issue of trying to > stop devices concurrently? Yes, as above > For PCI devices we obviously have the bus master bit to manage that, > but I could see how a migration extension for such support (perhaps > even just wired through to BM for PCI) could be useful. Thanks, I'm nervous to override the BM bit for something like this, the BM bit isn't a gentle "please coherently stop what you are doing" it is a hanbrake the OS pulls to ensure any PCI device becomes quiet. Thanks, Jason