On Thu, Dec 09, 2021 at 04:34:57PM -0700, Alex Williamson wrote: > On Tue, 7 Dec 2021 13:13:00 -0400 > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > Provide some more complete documentation for the migration regions > > behavior, specifically focusing on the device_state bits and the whole > > system view from a VMM. > > > > To: Alex Williamson <alex.williamson@xxxxxxxxxx> > > Cc: Cornelia Huck <cohuck@xxxxxxxxxx> > > Cc: kvm@xxxxxxxxxxxxxxx > > Cc: Max Gurtovoy <mgurtovoy@xxxxxxxxxx> > > Cc: Kirti Wankhede <kwankhede@xxxxxxxxxx> > > Cc: Yishai Hadas <yishaih@xxxxxxxxxx> > > Cc: Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx> > > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++- > > 1 file changed, 300 insertions(+), 1 deletion(-) > > I'm sending a rewrite of the uAPI separately. I hope this brings it > more in line with what you consider to be a viable specification and > perhaps makes some of the below new documentation unnecessary. It is far better than what was there before, and sufficiently terse it is OK in a header file. Really, it is quite a great job what you've got there. Honestly, I don't think I can write something at quite that level, if that is your expectation of what we need to achieve here.. > > +------------------------------------------------------------------------------- > > + > > +VFIO migration driver API > > +------------------------------------------------------------------------------- > > + > > +VFIO drivers that support migration implement a migration control register > > +called device_state in the struct vfio_device_migration_info which is in its > > +VFIO_REGION_TYPE_MIGRATION region. > > + > > +The device_state controls both device action and continuous behavior. > > +Setting/clearing bit groups triggers device action, and each bit controls a > > +continuous device behavior. > > This notion of device actions and continuous behavior seems to make > such a simple concept incredibly complicated. We have "is the device > running or not" and a new modifier bit to that, and which mode is the > migration region, off, saving, or resuming. Seems simple enough, but I > can't follow your bit groups below. It is an effort to bridge from the very simple view you wrote to a fuller understanding what the driver should be implementing. We must talk about SAVING|RUNING / SAVING|!RUNNING together to be able to explain everything that is going on. But we probably don't want the introductory paragraphs at all. Lets just refer to the header file and explain the following discussion elaborates on that. > > +Along with the device_state the migration driver provides a data window which > > +allows streaming migration data into or out of the device. The entire > > +migration data, up to the end of stream must be transported from the saving to > > +resuming side. > > + > > +A lot of flexibility is provided to user-space in how it operates these > > +bits. What follows is a reference flow for saving device state in a live > > +migration, with all features, and an illustration how other external non-VFIO > > +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in. > > + > > + RUNNING, VCPU_RUNNING > > + Normal operating state > > + RUNNING, DIRTY_TRACKING, VCPU_RUNNING > > + Log DMAs > > + > > + Stream all memory > > + SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING > > + Log internal device changes (pre-copy) > > + > > + Stream device state through the migration window > > + > > + While in this state repeat as desired: > > + > > + Atomic Read and Clear DMA Dirty log > > + > > + Stream dirty memory > > + SAVING | NDMA | RUNNING, VCPU_RUNNING > > + vIOMMU grace state > > + > > + Complete all in progress IO page faults, idle the vIOMMU > > + SAVING | NDMA | RUNNING > > + Peer to Peer DMA grace state > > + > > + Final snapshot of DMA dirty log (atomic not required) > > + SAVING > > + Stream final device state through the migration window > > + > > + Copy final dirty data > > So yes, let's move use of migration region in support of a VMM here, > but as I mentioned in the last round, these notes per state are all > over the map and some of them barely provide enough of a clue to know > what you're getting at. Let's start simple and build. I'm not sure what you are suggesting? Combined with the new header file this is much better, it tersely explains from a VMM point of view what each state is about Do you think this section should be longer and the section below much shorter? That might be a better document. > > + 0 > > + Device is halted > > We don't care what the device state goes to after we're done collecting > data from it. The reference flow is just a reference, choosing to go to 0 is fine, right? > > +and the reference flow for resuming: > > + > > + RUNNING > > + Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device > > + RESUMING > > + Push in migration data. > > + NDMA | RUNNING > > + Peer to Peer DMA grace state > > + RUNNING, VCPU_RUNNING > > + Normal operating state > > + > > +If the VMM has multiple VFIO devices undergoing migration then the grace > > +states act as cross device synchronization points. The VMM must bring all > > +devices to the grace state before advancing past it. > > Why? (rhetorical) Describe that we can't stop all device atomically > therefore we need to running-but-not-initiating state to quiesce the > system to finish up saving and the same because we can't atomically > release all devices on the restoring end. OK > > +Event triggered actions happen when user-space requests a new device_state > > +that differs from the current device_state. Actions happen on a bit group > > +basis: > > + > > + SAVING > > Does this mean the entire new device_state is (SAVING) or does this > mean that we set the SAVING bit independent of all other bits. It says "actions happen on a bit group basis", so independent of all other bits as you say But perhaps we don't need this at all anymore as the header file is sufficent enough > > + The device clears the data window and prepares to stream migration data. > > + The entire data from the start of SAVING to the end of stream is transfered > > + to the other side to execute a resume. > > "Clearing the data window" is an implementation, each iteration of the > migration protocol provides "something" in the data window. The > migration driver could take no action when SAVING is set and simply > evaluate what the current device state is when pending_bytes is read. It is the same as what you said: "initializes the migration region data window" > > + SAVING | RUNNING > > If we're trying to model typical usage scenarios, it's confusing that > we started with SAVING and jumped back to (SAVING | RUNNING). This section isn't about usage scenarios this is talking about what the driver must do in all the state combinations. SAVING is "initializing the data window" And then the two variations of RUNNING have their own special behaviors. > > + This allows the device to implement a dirty log for its internal state. > > + During this state the data window should present the device state being > > + logged and during SAVING | !RUNNING the data window should transfer the > > + dirtied state and conclude the migration data. > > As we discussed in the previous revision, invariant data could also > reasonably be included here. We're again sort of pushing an > implementation agenda, but the more useful thing to include here would > be to say something about how drivers and devices should attempt to > support any bulk data in this pre-copy phase in order to allow > userspace to perform a migration with minimal actual time in the next > state. Invarient data is implicitly already "device state being logged" - the log is always 'no change' > > + The state is only concerned with internal device state. External DMAs are > > + covered by the separate DIRTY_TRACKING function. > > + > > + SAVING | !RUNNING > > And this means we set SAVING and cleared RUNNING, and only those bits > or independent of other bits? Give your reader a chance to follow > along even if you do expect them to read it a few times for it all to > sink in. None of this is about set or cleared, where did you get that? The top paragph said: "requests a new device_state" - that means only the new device_state value matters, the change to get there is irrelevant. > > + If the migration data is invalid then the ERROR state must be set. > > I don't know why we're specifying this, it's at the driver discretion > to use the ERROR state, but we tend to suggest it for irrecoverable > errors. Maybe any such error here could be considered irrecoverable, > or maybe the last data segment was missing and once it's added we can > continue. This was an explicit statement that seems to contridict what you wrote in the header. I prefer we are deterministic, if the RESUME fails then go to ERROR, always. Devices do not have the choice to do something else. > > + ERROR > > + The behavior of the device is largely undefined. The device must be > > + recovered by issuing VFIO_DEVICE_RESET or closing the device file > > + descriptor. > > + > > + However, devices supporting NDMA must behave as though NDMA is asserted > > + during ERROR to avoid corrupting other devices or a VM during a failed > > + migration. > > As clarified in the uAPI, we chose the invalid state that we did as the > error state specifically because of the !RUNNING value. Migration > drivers should honor that, therefore NDMA in ERROR state is irrelevant. This is another explict statement that you have contridicted in the header. I'm not sure mlx5 can implement this. Certainly, it becomes very hard if we continue to support precedence. Unwinding an error during a multi-bit sequence and guaranteeing that we can somehow make it back to !RUNNING is far very complex. Several error scenarios mean the driver has lost control of the device. I'm not even sure we can do the !NDMA I wrote, in hindsight I don't think we checked that enough. Yishai noticed all the error unwinding was broken in mlx5 for precedence cases after I wrote this. > > + NDMA is made optional to support simple HW implementations that either just > > + cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only > > + necessary for special features like P2P and PRI, so devices can omit it in > > + exchange for limitations on the guest. > > Maybe we can emphasize this a little more as it's potentially pretty > significant. Developers should not just think of their own device in > isolation, but their device in the context of devices that may have > performance, if not functional, restrictions with those limitations. Ok > > + > > +- Devices that have their HW migration control MMIO registers inside the same > > + iommu_group as the VFIO device have some special considerations. In this > > + case a driver will be operating HW registers from kernel space that are also > > + subjected to userspace controlled DMA due to the iommu_group. > > + > > + This immediately raises a security concern as user-space can use Peer to > > + Peer DMA to manipulate these migration control registers concurrently with > > + any kernel actions. > > + > > + A device driver operating such a device must ensure that kernel integrity > > + cannot be broken by hostile user space operating the migration MMIO > > + registers via peer to peer, at any point in the sequence. Further the kernel > > + cannot use DMA to transfer any migration data. > > + > > + However, as discussed above in the "Device Peer to Peer DMA" section, it can > > + assume quiet MMIO as a condition to have a successful and uncorrupted > > + migration. > > + > > +To elaborate details on the reference flows, they assume the following details > > +about the external behaviors: > > + > > + !VCPU_RUNNING > > + User-space must not generate dirty pages or issue MMIO, PIO or equivalent > > + operations to devices. For a VMM this would typically be controlled by > > + KVM. > > + > > + DIRTY_TRACKING > > + Clear the DMA log and start DMA logging > > + > > + DMA logs should be readable with an "atomic test and clear" to allow > > + continuous non-disruptive sampling of the log. > > + > > + This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container > > + fd. > > + > > + !DIRTY_TRACKING > > + Freeze the DMA log, stop tracking and allow user-space to read it. > > + > > + If user-space is going to have any use of the dirty log it must ensure that > > + all DMA is suspended before clearing DIRTY_TRACKING, for instance by using > > + NDMA or !RUNNING on all VFIO devices. > > Minimally there should be reference markers to direct to these > definitions before they were thrown at the reader in the beginning, but > better yet would be to adjust the flow to make them unnecessary. The first draft was orderd like this, Connie felt that was confusing, so it was moved to the end :) > > +TDB - discoverable feature flag for NDMA > > Updated in the uAPI spec. Thanks, It matches what Yishai did Jason