On 9/21/2022 2:44 PM, Jason Gunthorpe wrote: > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote: > >>> I still think the compat gaps are small. I've realized that >>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it >>> can deadlock the kernel I propose we purge it completely. >> >> Steve won't be happy to hear that, QEMU support exists but isn't yet >> merged. "unhappy" barely scratches the surface of my feelings! Live update is a great feature that solves a real problem, and lots of people have spent lots of time providing thorough feedback on the qemu patches I have submitted. We *will* cross the finish line. In the mean time, I maintain a patched qemu for use in my company, and I have heard from others who do the same. > If Steve wants to keep it then someone needs to fix the deadlock in > the vfio implementation before any userspace starts to appear. The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it. I am not aware of a deadlock problem. Please elaborate or point me to an email thread. > I can fix the deadlock in iommufd in a terrible expensive way, but > would rather we design a better interface if nobody is using it yet. I > advocate for passing the memfd to the kernel and use that as the page > provider, not a mm_struct. memfd support alone is not sufficient. Live update also supports guest ram backed by named shared memory. - Steve >> The issue is where we account these pinned pages, where accounting is >> necessary such that a user cannot lock an arbitrary number of pages >> into RAM to generate a DoS attack. > > It is worth pointing out that preventing a DOS attack doesn't actually > work because a *task* limit is trivially bypassed by just spawning > more tasks. So, as a security feature, this is already very > questionable. > > What we've done here is make the security feature work to actually > prevent DOS attacks, which then gives you this problem: > >> This obviously has implications. AFAICT, any management tool that >> doesn't instantiate assigned device VMs under separate users are >> essentially untenable. > > Because now that the security feature works properly it detects the > DOS created by spawning multiple tasks :( > > Somehow I was under the impression there was not user sharing in the > common cases, but I guess I don't know that for sure. > >>> So, I still like 2 because it yields the smallest next step before we >>> can bring all the parallel work onto the list, and it makes testing >>> and converting non-qemu stuff easier even going forward. >> >> If a vfio compatible interface isn't transparently compatible, then I >> have a hard time understanding its value. Please correct my above >> description and implications, but I suspect these are not just >> theoretical ABI compat issues. Thanks, > > Because it is just fine for everything that doesn't use the ulimit > feature, which is still a lot of use cases! > > Remember, at this point we are not replacing /dev/vfio/vfio, this is > just providing the general compat in a form that has to be opted > into. I think if you open the /dev/iommu device node then you should > get secured accounting. > > If /dev/vfio/vfio is provided by iommufd it may well have to trigger a > different ulimit tracking - if that is the only sticking point it > seems minor and should be addressed in some later series that adds > /dev/vfio/vfio support to iommufd.. > > Jason