Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

Joao Martins <joao.m.martins@xxxxxxxxxx> · Mon, 28 Feb 2022 13:01:39 +0000

On 2/25/22 20:44, Jason Gunthorpe wrote:
> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>> in terms of direction...
>>>>>
>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>> will go forward.
>>>>>
>>>> /me nods, will do. Looking at your repository it is looking good.
>>>
>>> I would like to come with some plan for dirty tracking on iommufd and
>>> combine that with a plan for dirty tracking inside the new migration
>>> drivers.
>>>
>> I had a few things going on my end over the past weeks, albeit it is
>> getting a bit better now and I will be coming back to this topic. I hope/want
>> to give you a more concrete update/feedback over the coming week or two wrt
>> to dirty-tracking+iommufd+amd.
>>
>> So far, I am not particularly concerned that this will affect overall iommufd
>> design. The main thing is really lookups to get vendor iopte, upon on what might
>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>> the tracking,
> 
> I'm not very keen on these multiplexer interfaces. I think you should
> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> 'read_dirty_bits'
> 
> NULL op means not supported.
> 
> IMHO we don't need a kapi wrapper if only iommufd is going to call the
> op.
> 

OK, good point.

Even without a kapi wrapper I am still wondering whether the iommu op needs to
be something like a generic iommu feature toggling (e.g. .set_feature()), rather
than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
the only feature we will change out-of-band in the protection-domain.
I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
because the sync() doesn't only read, but also "writes" to root pagetable to update
the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

And TBH, the question on whether we need a clear op isn't immediately obvious: reading
the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
get_dirty_bitmap() and clear_dirty_ditmap().

Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of
PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is,
rather than forcing sw to split/merge pagetables for better granularity and having to
flush IOTLBs for dirty to be written again :(

>> I'll be simplifying the interface in the other type1 series I had and making it
>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>> enable/disable over a domain. Perhaps same trick could be expanded to other
>> features to have a bit more control on what userspace is allowed to use. I think
>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>> to have full control during the different stages of migration of dirty tracking.
>> In all of the IOMMU implementations/manuals I read it means setting a protection
>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>> (albeit past work had also it always-on).
>>
>> Provided the iommufd does /separately/ more finer granularity on what we can
>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>> at will as separate operations, before and after migration respectivally. That logic
>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>> expensive.
> 
> This all sounds right to me
> 
> Questions I have:
>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>    ranges? how/what/why?
> 
Ignore that. I got mislead by the implementation and when I read the SDM
I realized that the implementation was doing the same thing I was doing
i.e. enabling dirty-bit in the protection domain at start rather than
dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
bit in the context descriptor to enable dirty bits or the STE.S2HD in the
stream table entry for the stage2 equivalent. Nothing here is per-range
basis. And the ranges was only used by the implementation for the eager
splitting/merging of IO page table levels.

>  - What about the unmap and read dirty without races operation that
>    vfio has?
> 
I am afraid that might need a new unmap iommu op that reads the dirty bit
after clearing the page table entry. It's marshalling the bits from
iopte into a bitmap as opposed to some logic added on top. As far as I
looked for AMD this isn't difficult to add, (same for Intel) it can
*I think* reuse all of the unmap code.

>>>> I, too, have been wondering what that is going to look like -- and how do we
>>>> convey the setup of dirty tracking versus the steering of it.
>>>
>>> What I suggested was to just split them.
>>>
>>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
>>> would be on a per-domain basis, not on the ioas.
>>>
>>> Some ioctl toward the vfio device will turn on the device's tracker.
>>>
>> In the activation/fetching-data of either trackers I see some things in common in
>> terms of UAPI with the difference that whether a device or a list of devices are passed on
>> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
>> your suggestion)
> 
> I was thinking a VFIO_DEVICE ioctl located on the device FD
> implemented in the end VFIO driver (like mlx5_vfio). No lists..
> 
Interesting. I was assuming that we wanted to keep the same ioctl()
interface for dirty-tracking hence me mentioning the device/device-list on
top of the DIRTY ioctl. But well on a second thought it doesn't make sense
given that we query a container and we want to move away from vfio for iopgtbl
related-work and rather into iommufd direct access instead by the VMM. So
placing more dependency on that ioctl wouldn't align with that. I suppose, it
makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl().

> As you say the driver should just take in the request to set dirty
> tracking and take core of it somehow. There is no value the core VFIO
> code can add here.
> 
>> Albeit perhaps the main difference is going to be that one needs to
>> setup with hardware interface with the device tracker and how we
>> carry the regions of memory that want to be tracked i.e. GPA/IOVA
>> ranges that the device should track. The tracking-GPA space is not
>> linear GPA space sadly. But at the same point perhaps the internal
>> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
>> ranges we mapped.
> 
> Yes, this is a point that needs some answering. One option is to pass
> in the tracking range list from userspace. Another is to query it in
> the driver from the currently mapped areas in IOAS.
> 
I sort of like the second given that we de-duplicate info that is already
tracked by IOAS -- it would be mostly internal and then it would be a
matter of when does this device tracker is set up, and whether we should
differentiate tracker "start"/"stop" vs "setup"/"teardown".

> I know devices have limitations here in terms of how many/how big the
> ranges can be, and devices probably can't track dynamic changes.
> 
/me nods

Should this be some sort of capability perhaps? Given that this may limit
how many concurrent VFs can be migrated and how much ranges it can store,
for example.

(interestingly and speaking of VF capabilities, it's yet another thing to
tackle in the migration stream between src and dst hosts)

>> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
>> device tracker might mean giving all the IOVA ranges to the PF (once?).
>> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
>> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
>> the driver needs some form of awareness of and convey that to the PF for
>> tracking purposes.
> 
> Yes, this is right
>  
>> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
>> following by a smaller cost copying back to userspace (which KVM does too, when it's using
>> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
>> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
>> (from PF) number of leading zeroes within some threshold.
> 
> What I would probably strive for is an API that deliberately OR's in
> the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
> in your dirty data', then do the next page. etc. That is 134M of IOVA
> per chunk which seems OK.
> 
Yeap, sort of what I was aiming for.

>>> This makes qemu more complicated because it has to decide what
>>> trackers to turn on, but that is also the point because we do want
>>> userspace to be able to decide.
>>>
>> If the interface wants extending to pass a device or an array of devices (if I understood
>> you correctly), it would free/simplify VFIO from having to concatenate potentially
>> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
>> more to avoid performing too much copying.
> 
> Yes. Currently VFIO maintains its own bitmap so it also saves that
> memory by keeping the dirty bits stored in the IOPTEs until read out.
>  
/me nods with the iommu, yes.