Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

Joao Martins <joao.m.martins@xxxxxxxxxx> · Tue, 3 May 2022 11:48:11 +0100

On 5/2/22 19:52, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>> On Fri, 29 Apr 2022 05:45:20 +0000
>> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>>>> From: Joao Martins <joao.m.martins@xxxxxxxxxx>
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.  
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> Sorry, I'm not following why an unmap with returned dirty bitmap
>> operation is specific to a vIOMMU case, or in fact indicative of some
>> sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not. 
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 

Exactly.

Even current unmap (ignoring dirties) isn't race-free and DMA could still be
happening between clearing PTE until the IOTLB flush.

The code in this series *attempted* at tackling races against hw IOMMU updates
to the A/D bits at the same time we are clearing the IOPTEs. But it didn't fully
addressed the race with DMA.

The current code (IIUC) just assumes it is dirty if it as pinned and DMA mapped,
so maybe it avoided some of these fundamental questions...

So really the comparison is whether we care of fixing the race *during unmap* --
which really device shouldn't be DMA-ing to in the first place -- that we need
to go out of our way to block DMA writes from happening then fetch dirties and
then unmap. Or can we fetch dirties and then unmap as two separate operations.

>> We need the flexibility to support memory hot-unplug operations
>> during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 

hmmm I don't follow either -- why one would we care about hot-unplugged
memory being dirty? Unless Alex is thinking that the guest would take
initiative in hotunplugging+hotplugging and expecting the same data to
be there, like pmem style...?

>> This was implemented as a single operation specifically to avoid
>> races where ongoing access may be available after retrieving a
>> snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 

... and that is only the DMA performance part affecting the endpoint
device. In software, there's also the extra overhead of walking the IOMMU
pagetables twice. So it's like unmap being 2x more expensive.

> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..
> 

Definitely, I am happy to implement it if there's a use-case. But
I am not sure there's one right now aside from theory only? Have we
see issues that would otherwise require this?

> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)

Yes, that's a great idea. I am thinking of adding a regular @flags field to
the GET_DIRTY_IOVA and iommu domain op argument counterpart.

Albeit, from iommu kAPI side at the end of the day this primitive is an IO
pagetable walker helper which lets it check/manipulate some of the IOPTE
special bits and marshal its state into a bitmap. Extra ::flags values could
be other access bits, avoiding clearing said bits or more should we want to
make it more future-proof to extensions.