Re: iommufd(+vfio-compat) dirty tracking

Joao Martins <joao.m.martins@xxxxxxxxxx> · Wed, 16 Mar 2022 16:36:46 +0000

On 3/15/22 19:29, Jason Gunthorpe wrote:
> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>> On 2/28/22 13:01, Joao Martins wrote:
>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> I'll be simplifying the interface in the other type1 series I had and making it
>>>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>>>> features to have a bit more control on what userspace is allowed to use. I think
>>>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>>>> to have full control during the different stages of migration of dirty tracking.
>>>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>>>> (albeit past work had also it always-on).
>>>>>
>>>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>>>> at will as separate operations, before and after migration respectivally. That logic
>>>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>>>> expensive.
>>>>
>>>> This all sounds right to me
>>>>
>>>> Questions I have:
>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>    ranges? how/what/why?
>>>>
>>> Ignore that. I got mislead by the implementation and when I read the SDM
>>> I realized that the implementation was doing the same thing I was doing
>>> i.e. enabling dirty-bit in the protection domain at start rather than
>>> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
>>> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
>>> stream table entry for the stage2 equivalent. Nothing here is per-range
>>> basis. And the ranges was only used by the implementation for the eager
>>> splitting/merging of IO page table levels.
>>>
>>>>  - What about the unmap and read dirty without races operation that
>>>>    vfio has?
>>>>
>>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>>> after clearing the page table entry. It's marshalling the bits from
>>> iopte into a bitmap as opposed to some logic added on top. As far as I
>>> looked for AMD this isn't difficult to add, (same for Intel) it can
>>> *I think* reuse all of the unmap code.
>>>
>>
>> OK, made some progress.
> 
> Nice!
> 
>  
>> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
>> revising locking, bugs, and more -- works with my emulated qemu patches which
>> is a good sign. But hopefully starts some sort of skeleton of what we were
>> talking about in the above thread.
> 
> I'm a bit bogged with the coming merge window right now, but will have
> more to say in a bit
> 

Thanks! Take your time :)

>> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
>> the vfio-compat one as it was easier provided there's existing userspace to work
>> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
>> steering the dirty tracking, read-clear the dirty bits, unmap and
>> get dirty. 
> 
> I think this is fine to start experimenting with
> 
>> But as we discussed earlier, the one gap of VFIO dirty API is that
>> it lacks controls for upgrading/downgrading area/IOPTE sizes which
>> is where IOMMUFD would most likely shine. When that latter part is
>> done we can probably adopt an eager-split approach inside
>> vfio-compat.
> 
> I think the native API should be new ioctls that operate on the
> hw_pagetable object to:
>   - enable/disable dirty tracking 
>   - read&clear a bitmap from a range
>   - read&unmap a bitmap from a range
>   - Manipulate IOPTE sizes
> 
Yeap -- heh nice, it is roughly what I was into already. I will take the
opportunity of doing the zerocopy/gup of the bitmap when writing the
iommufd native ioctls. The vfio-compat is still copying.

other things on my mind for iommufd native interface are:

1) The iopte split *could* be while we read the
dirty bits of a pgsize != than the args pgsize. The
read_and_clear_dirty() is already expensive, but I wonder about the idea
to take advantage of the dirty-clearing needed IOTLB flush, to also also
update the IOPTEs on the next memory-request/translation-request. Albeit
iopte collpase still needs to be routed via a specific ioctl() to promote
to a higher page size (should the migration fail or something).

(I referred to this in the past as lazy-split contrast to explicit IOPTE
size manipulation iommufd ioctls i.e. {manual,eager}-split)

2) I was thinking io_pagetable new APIs instead of hw_pagetable, but if you're
sort of seeing this as a hw_pagetable object primitive, then I am
wondering why? Albeit at the end of the day we operate on the iopt inside the hw_pagetable.

Note: IIUC that hw_pagetable is supposed to model a iommu_domain
direct manipulation which dirty fits in, but wouldn't that be applicable
to _MAP and _UNMAP too maybe?

> As you say it should not be much distance from the VFIO compat stuff
> 
> Most probably I would say to leave dirty tracking out of the type1 api
> and compat for it. Maybe we can make some limited cases work back
> compat, like the whole ioas supports iommu dirty tracking or
> something..
> 
> Need to understand if it is wortwhile - remember to use any of this
> you need a qemu that is updated to the v2 migration interface,
> so there is little practical value in back compat to old qemu if we
> expect qemu will use the native interface anyhow.
> 

Perhaps the value is really just *how much* applications need to get rewritten
to adjust to iommufd versus iteratively adding small (new) parts of it .
migration-v2 didn't turn into a complete rewrite of the vfio initial part iiuc.
Albeit I suspect keeping the compat was perhaps a struggle to keep, which we
might not what we desire if the semantics diverge too much.

e.g. If the semantics are about the same, for example, new VMMs could use the
iommufd new features using IOPTE size change ioctls() via the VFIO_IOAS (if available)
while leaving vfio-compat somewhat usable on the existent dirty ioctls. While slowly
moving to newer iommufd (and less vfio-iommu-type1) until the point you totally deprecate
it. Even without IOPTE size modification in vfio-compat you would still mark as dirty
a lot less than the current logic would (which really represents the whole bitmap).

I guess at the end of the day it is the maintenance cost -- Fortunately
APIs are looking similar and most of vfio-compat is dealing with userspace
args/validation.

Anyhow just some thoughts, open to anything really :)

>> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
>> Some of the commits made notes of some of research I made, so *I think* the APIs
>> introduced capture h/w semantics for all the three IOMMUs supporting dirty
>> tracking.
> 
> I think the primitives are pretty basic, I'd be surprised if there is
> something different :)
> 
Yeah, they are very basic.

My main unknown in the rework was how the hw protection domain context can deal with
dynamically switching dirty on and off. But I think I answered myself that restriction at
least from groking into specs and experimentation.

> Though things to be thinking about are how does this work with nested
> translation and other advanced features
/me nods

There's usually some flags/bits on the 'stage 2' domain, at least ARM (S2CD) and Intel
(second-level table). But for AMD, you usually control via the vIOMMU what features the hw
exposes[*], and I think there's an extra bit to set to disable vIOMMU dirty tracking from
VMM. But I think still the only thing needed is first-stage tracking, as that's what
matters for L0 live migration.

[*] AMD vIOMMU also offloads some of the command processing, not just the pagetables. Not
sure if ARM/Intel has the equivalent.