On Thu, 14 Nov 2019 01:47:04 +0530 Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > On 11/14/2019 1:18 AM, Alex Williamson wrote: > > On Thu, 14 Nov 2019 00:59:52 +0530 > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > >> On 11/13/2019 11:57 PM, Alex Williamson wrote: > >>> On Wed, 13 Nov 2019 11:24:17 +0100 > >>> Cornelia Huck <cohuck@xxxxxxxxxx> wrote: > >>> > >>>> On Tue, 12 Nov 2019 15:30:05 -0700 > >>>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > >>>> > >>>>> On Tue, 12 Nov 2019 22:33:36 +0530 > >>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >>>>> > >>>>>> - Defined MIGRATION region type and sub-type. > >>>>>> - Used 3 bits to define VFIO device states. > >>>>>> Bit 0 => _RUNNING > >>>>>> Bit 1 => _SAVING > >>>>>> Bit 2 => _RESUMING > >>>>>> Combination of these bits defines VFIO device's state during migration > >>>>>> _RUNNING => Normal VFIO device running state. When its reset, it > >>>>>> indicates _STOPPED state. when device is changed to > >>>>>> _STOPPED, driver should stop device before write() > >>>>>> returns. > >>>>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but > >>>>>> start saving state of device i.e. pre-copy state > >>>>>> _SAVING => vCPUs are stopped, VFIO device should be stopped, and > >>>>> > >>>>> s/should/must/ > >>>>> > >>>>>> save device state,i.e. stop-n-copy state > >>>>>> _RESUMING => VFIO device resuming state. > >>>>>> _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states > >>>>> > >>>>> A table might be useful here and in the uapi header to indicate valid > >>>>> states: > >>>> > >>>> I like that. > >>>> > >>>>> > >>>>> | _RESUMING | _SAVING | _RUNNING | Description > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 0 | 0 | 0 | Stopped, not saving or resuming (a) > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 0 | 0 | 1 | Running, default state > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 0 | 1 | 0 | Stopped, migration interface in save mode > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 0 | 1 | 1 | Running, save mode interface, iterative > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 1 | 0 | 0 | Stopped, migration resume interface active > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 1 | 0 | 1 | Invalid (b) > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 1 | 1 | 0 | Invalid (c) > >>>>> +-----------+---------+----------+------------------------------------------ > >>>>> | 1 | 1 | 1 | Invalid (d) > >>>>> > >>>>> I think we need to consider whether we define (a) as generally > >>>>> available, for instance we might want to use it for diagnostics or a > >>>>> fatal error condition outside of migration. > >>>>> > >>>>> Are there hidden assumptions between state transitions here or are > >>>>> there specific next possible state diagrams that we need to include as > >>>>> well? > >>>> > >>>> Some kind of state-change diagram might be useful in addition to the > >>>> textual description anyway. Let me try, just to make sure I understand > >>>> this correctly: > >>>> > >> > >> During User application initialization, there is one more state change: > >> > >> 0) 0/0/0 ---- stop to running -----> 0/0/1 > > > > 0/0/0 cannot be the initial state of the device, that would imply that > > a device supporting this migration interface breaks backwards > > compatibility with all existing vfio userspace code and that code needs > > to learn to set the device running as part of its initialization. > > That's absolutely unacceptable. The initial device state must be 0/0/1. > > > > There isn't any device state for all existing vfio userspace code right > now. So default its assumed to be always running. Exactly, there is no representation of device state, therefore it's assumed to be running, therefore when adding a representation of device state it must default to running. > With migration support, device states are explicitly getting added. For > example, in case of QEMU, while device is getting initialized, i.e. from > vfio_realize(), device_state is set to 0/0/0, but not required to convey > it to vendor driver. But we have a 0/0/0 state, why would we intentionally keep an internal state that's inconsistent with the device? > Then with vfio_vmstate_change() notifier, device > state is changed to 0/0/1 when VM/vCPU are transitioned to running, at > this moment device state is conveyed to vendor driver. So vendor driver > doesn't see 0/0/0 state. But the running state is the state of the device, not the VM or the vCPU. Sure we might want to stop the device if the VM/vCPU state is stopped, but we must accept that the device is running when it's opened and we shouldn't intentionally maintain inconsistent state. > While resuming, for userspace, for example QEMU, device state change is > from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic > initialization is done. I don't see why this matters, all device_state transitions are written directly to the vendor driver. The device is initially in 0/0/1 and can be set to 1/0/0 for resuming with an optional transition through 0/0/0 and the vendor driver can see each state change. > >>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1 > >> > >> not just gathering state info, but also copy device state to be > >> transferred during pre-copy phase. > >> > >> Below 2 state are not just to tell driver to stop, those 2 differ. > >> 2) is device state changed from running to stop, this is when VM > >> shutdowns cleanly, no need to save device state > > > > Userspace is under no obligation to perform this state change though, > > backwards compatibility dictates this. > > > >>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 > >> > >>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0 > >> > >> above is transition from pre-copy phase to stop-and-copy phase, where > >> device data should be made available to user to transfer to destination > >> or to save it to file in case of save VM or suspend. > >> > >> > >>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0 > >>> > >>> I think this is to switch into resuming mode, the data will follow > > >>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1 > >>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1 > >>> > >> > >> above can occur on migration cancelled or failed. > >> > >> > >>> I think also: > >>> > >>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy > >> > >> that's right, this happens in case of save VM or suspend VM. > >> > >>> > >>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see > >>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the > >>> received state is incomplete. Somehow though a user always needs to > >>> return the device to the initial state, so how does device_state > >>> interact with the reset ioctl? Would this automatically manipulate > >>> device_state back to 0/0/1? > >> > >> why would reset occur on 1/0/0 -> 0/0/1 failure? > > > > The question is whether the reset ioctl automatically puts the device > > back into the initial state, 0/0/1. A reset from 1/0/0 -> 0/0/1 > > presumably discards much of the device state we just restored, so > > clearly that would be undesirable. > > > >> 1/0/0 -> 0/0/1 fails, then user should convey that to source that > >> migration has failed, then resume at source. > > > > In the scheme of the migration yet, but as far as the vfio interface is > > concerned the user should have a path to make use of a device after > > this point without closing it and starting over. Thus, if a 1/0/0 -> > > 0/0/1 transition fails, would we define the device reset ioctl as a > > mechanism to flush the bogus state and place the device into the 0/0/1 > > initial state? > > > > Ok, userspace applications can be designed to do that. As of now with > QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure. It's simply an ioctl, we must already have access to the device file descriptor to perform the device_state transition. QEMU is not necessarily the consumer of this behavior though, if transition 1/0/0 -> 0/0/1 fails in QEMU, it very well may just exit. The vfio API should support a defined mechanism to recover the device from this state though, which I propose is the existing reset ioctl, which logically implies that any device reset returns the device_state to 0/0/1. > >>>> Not sure about the usefulness of 2). > >> > >> I explained this above. > >> > >>>> Also, is 4) the only way to > >>>> trigger resuming? > >> Yes. > >> > >>>> And is the change in 5) performed by the driver, or > >>>> by userspace? > >>>> > >> By userspace. > >> > >>>> Are any other state transitions valid? > >>>> > >>>> (...) > >>>> > >>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase > >>>>>> + * and for _SAVING device state or stop-and-copy phase: > >>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps. > >>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer. > >>>>>> + * Kernel driver should return this read operation only after writing data to > >>>>>> + * staging buffer is done. > >>>>> > >>>>> "staging buffer" implies a vendor driver implementation, perhaps we > >>>>> could just state that data is available from (region + data_offset) to > >>>>> (region + data_offset + data_size) upon return of this read operation. > >>>>> > >>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in > >>>>>> + * migration region. > >>>>>> + * d. read data_size bytes of data from data_offset in the migration region. > >>>>>> + * e. process data. > >>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data > >>>>>> + * operation from migration region for previous iteration is done. > >>>>> > >>>>> I think this indicate that step (f) should be to read pending_bytes, the > >>>>> read sequence is not complete until this step. Optionally the user can > >>>>> then proceed to step (b). There are no read side-effects of (a) afaict. > >>>>> > >>>>> Is the use required to reach pending_bytes == 0 before changing > >>>>> device_state, particularly transitioning to !_RUNNING? Presumably the > >>>>> user can exit this sequence at any time by clearing _SAVING. > >>>> > >>>> That would be transition 6) above (abort saving and continue). I think > >>>> it makes sense not to forbid this. > >>>> > >>>>> > >>>>>> + * > >>>>>> + * Sequence to be followed while _RESUMING device state: > >>>>>> + * While data for this device is available, repeat below steps: > >>>>>> + * a. read data_offset from where user application should write data. > >>>>>> + * b. write data of data_size to migration region from data_offset. > >>>>>> + * c. write data_size which indicates vendor driver that data is written in > >>>>>> + * staging buffer. Vendor driver should read this data from migration > >>>>>> + * region and resume device's state. > >>>>> > >>>>> The device defaults to _RUNNING state, so a prerequisite is to set > >>>>> _RESUMING and clear _RUNNING, right? > >>>> > >> > >> Sorry, I replied yes in my previous reply, but no. Default device state > >> is _STOPPED. During resume _STOPPED -> _RESUMING > > > > Nope, it can't be, it must be _RUNNING. > > > >>>> Transition 4) above. Do we need > >> > >> I think, its not required. > > > > But above we say it's the only way to trigger resuming (4 was 0/0/1 -> > > 1/0/0). > > > >>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0 > >>>> as well? (Probably depends on how sensible the 0/0/0 state is.) > >>> > >>> I think we must unless we require the user to transition from 0/0/1 to > >>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally > >>> available. Thanks, > >>> > >> > >> its 0/0/0 -> 1/0/0 while resuming. > > > > I think we're starting with different initial states, IMO there is > > absolutely no way around 0/0/1 being the initial device state. > > Anything otherwise means that we cannot add migration support to an > > existing device and maintain compatibility with existing userspace. > > Thanks, > > > Hope above explanation helps to resolve this concern. Not really, I stand by that the default state must reflect previous assumptions and therefore it must be 0/0/1 and additionally we should not maintain state in QEMU intentionally inconsistent with the device state. Thanks, Alex