Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

Joao Martins <joao.m.martins@xxxxxxxxxx> · Fri, 25 Feb 2022 19:18:37 +0000

On 2/23/22 01:03, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>> in terms of direction...
>>>
>>> I think go ahead and build it on top of iommufd, start working out the
>>> API details, etc. I think once the direction is concluded the new APIs
>>> will go forward.
>>>
>> /me nods, will do. Looking at your repository it is looking good.
> 
> I would like to come with some plan for dirty tracking on iommufd and
> combine that with a plan for dirty tracking inside the new migration
> drivers.
> 
I had a few things going on my end over the past weeks, albeit it is
getting a bit better now and I will be coming back to this topic. I hope/want
to give you a more concrete update/feedback over the coming week or two wrt
to dirty-tracking+iommufd+amd.

So far, I am not particularly concerned that this will affect overall iommufd
design. The main thing is really lookups to get vendor iopte, upon on what might
be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking,
I'll be simplifying the interface in the other type1 series I had and making it
a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
enable/disable over a domain. Perhaps same trick could be expanded to other
features to have a bit more control on what userspace is allowed to use. I think
this just needs to set/clear a feature bit in the domain, for VFIO or userspace
to have full control during the different stages of migration of dirty tracking.
In all of the IOMMU implementations/manuals I read it means setting a protection
domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
(albeit past work had also it always-on).

Provided the iommufd does /separately/ more finer granularity on what we can
do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
at will as separate operations, before and after migration respectivally. That logic
would probably be better to be in separate iommufd ioctls(), as that it's going to be
expensive.

>> I, too, have been wondering what that is going to look like -- and how do we
>> convey the setup of dirty tracking versus the steering of it.
> 
> What I suggested was to just split them.
> 
> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> would be on a per-domain basis, not on the ioas.
> 
> Some ioctl toward the vfio device will turn on the device's tracker.
> 
In the activation/fetching-data of either trackers I see some things in common in
terms of UAPI with the difference that whether a device or a list of devices are passed on
as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
your suggestion)

Albeit perhaps the main difference is going to be that one needs to setup with
hardware interface with the device tracker and how we carry the regions of memory that
want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA
space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API
between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped.

In IOMMU this is sort of cheap and 'stateless', but on the setup of the
device tracker might mean giving all the IOVA ranges to the PF (once?).
Perhaps leaving to the vendor driver to pick when to register the IOVA space to
be tracked, or perhaps when you turn on the device's tracker. But on all cases,
the driver needs some form of awareness of and convey that to the PF for
tracking purposes.

> Userspace has to iterate and query each enabled tracker. This is not
> so bad because the time to make the system call is going to be tiny
> compared to the time to marshal XXGB of dirty bits.
> 
Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
following by a smaller cost copying back to userspace (which KVM does too, when it's using
a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
(gup as you mentioned earlier in the thread), or just copying based on the input bitmap
(from PF) number of leading zeroes within some threshold.

> This makes qemu more complicated because it has to decide what
> trackers to turn on, but that is also the point because we do want
> userspace to be able to decide.
> 
If the interface wants extending to pass a device or an array of devices (if I understood
you correctly), it would free/simplify VFIO from having to concatenate potentially
different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
more to avoid performing too much copying.

> The other idea that has some possible interest is to allow the
> trackers to dump their dirty bits into the existing kvm tracker, then
> userspace just does a single kvm centric dirty pass.

That would probably limit certain more modern options of ring based dirty tracking,
as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
would require KVM to always use a bitmap during migration/dirty-rate-estimation with
the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
it's a module).

Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with
reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one.
I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was
made for that purpose).