Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

Steven Sistare <steven.sistare@xxxxxxxxxx> · Wed, 21 Sep 2022 15:30:55 -0400

On 9/21/2022 2:44 PM, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> 
>>> I still think the compat gaps are small. I've realized that
>>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
>>> can deadlock the kernel I propose we purge it completely.
>>
>> Steve won't be happy to hear that, QEMU support exists but isn't yet
>> merged.

"unhappy" barely scratches the surface of my feelings!

Live update is a great feature that solves a real problem, and lots of 
people have spent lots of time providing thorough feedback on the qemu
patches I have submitted.  We *will* cross the finish line.  In the
mean time, I maintain a patched qemu for use in my company, and I have
heard from others who do the same.

> If Steve wants to keep it then someone needs to fix the deadlock in
> the vfio implementation before any userspace starts to appear. 

The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
across exec, which can result in mm->locked_vm becoming negative. I have several 
fixes, but none result in limits being reached at exactly the same time as before --
the same general issue being discussed for iommufd.  I am still thinking about it.

I am not aware of a deadlock problem.  Please elaborate or point me to an
email thread.

> I can fix the deadlock in iommufd in a terrible expensive way, but
> would rather we design a better interface if nobody is using it yet. I
> advocate for passing the memfd to the kernel and use that as the page
> provider, not a mm_struct.

memfd support alone is not sufficient.  Live update also supports guest ram
backed by named shared memory.

- Steve

>> The issue is where we account these pinned pages, where accounting is
>> necessary such that a user cannot lock an arbitrary number of pages
>> into RAM to generate a DoS attack.  
> 
> It is worth pointing out that preventing a DOS attack doesn't actually
> work because a *task* limit is trivially bypassed by just spawning
> more tasks. So, as a security feature, this is already very
> questionable.
> 
> What we've done here is make the security feature work to actually
> prevent DOS attacks, which then gives you this problem:
> 
>> This obviously has implications.  AFAICT, any management tool that
>> doesn't instantiate assigned device VMs under separate users are
>> essentially untenable.
> 
> Because now that the security feature works properly it detects the
> DOS created by spawning multiple tasks :(
> 
> Somehow I was under the impression there was not user sharing in the
> common cases, but I guess I don't know that for sure.
> 
>>> So, I still like 2 because it yields the smallest next step before we
>>> can bring all the parallel work onto the list, and it makes testing
>>> and converting non-qemu stuff easier even going forward.
>>
>> If a vfio compatible interface isn't transparently compatible, then I
>> have a hard time understanding its value.  Please correct my above
>> description and implications, but I suspect these are not just
>> theoretical ABI compat issues.  Thanks,
> 
> Because it is just fine for everything that doesn't use the ulimit
> feature, which is still a lot of use cases!
> 
> Remember, at this point we are not replacing /dev/vfio/vfio, this is
> just providing the general compat in a form that has to be opted
> into. I think if you open the /dev/iommu device node then you should
> get secured accounting.
> 
> If /dev/vfio/vfio is provided by iommufd it may well have to trigger a
> different ulimit tracking - if that is the only sticking point it
> seems minor and should be addressed in some later series that adds
> /dev/vfio/vfio support to iommufd..
> 
> Jason