On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote: > On 10/11/2022 8:30 AM, Jason Gunthorpe wrote: > > On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote: > >>> Do we have a solution to this? > >>> > >>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR > >>> > >>> Aside from the approach to use the FD, another idea is to just use > >>> fork. > >>> > >>> qemu would do something like > >>> > >>> .. stop all container ioctl activity .. > >>> fork() > >>> ioctl(CHANGE_MM) // switch all maps to this mm > >>> .. signal parent.. > >>> .. wait parent.. > >>> exit(0) > >>> .. wait child .. > >>> exec() > >>> ioctl(CHANGE_MM) // switch all maps to this mm > >>> ..signal child.. > >>> waitpid(childpid) > >>> > >>> This way the kernel is never left without a page provider for the > >>> maps, the dummy mm_struct belonging to the fork will serve that role > >>> for the gap. > >>> > >>> And the above is only required if we have mdevs, so we could imagine > >>> userspace optimizing it away for, eg vfio-pci only cases. > >>> > >>> It is not as efficient as using a FD backing, but this is super easy > >>> to implement in the kernel. > >> > >> I propose to avoid deadlock for mediated devices as follows. Currently, an > >> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR > >> is asserted. > >> > >> * In vfio_wait, I will maintain a list of waiters, each list element > >> consisting of (task, mdev, close_flag=false). > >> > >> * When the vfio device descriptor is closed, vfio_device_fops_release > >> will notify the vfio_iommu driver, which will find the mdev on the waiters > >> list, set elem->close_flag=true, and call wake_up_process for the task. > > > > This alone is not sufficient, the mdev driver can continue to > > establish new mappings until it's close_device function > > returns. Killing only existing mappings is racy. > > > > I think you are focusing on the one issue I pointed at, as I said, I'm > > sure there are more ways than just close to abuse this functionality > > to deadlock the kernel. > > > > I continue to prefer we remove it completely and do something more > > robust. I suggested two options. > > It's not racy. New pin requests also land in vfio_wait if any vaddr's have > been invalidated in any vfio_dma in the iommu. See > vfio_iommu_type1_pin_pages() > if (iommu->vaddr_invalid_count) > vfio_find_dma_valid() > vfio_wait() I mean you can't do a one shot wakeup of only existing waiters, and you can't corrupt the container to wake up waiters for other devices, so I don't see how this can be made to work safely... It also doesn't solve any flow that doesn't trigger file close, like a process thread being stuck on the wait in the kernel. eg because a trapped mmio triggered an access or something. So it doesn't seem like a workable direction to me. > However, I will investigate saving a reference to the file object in > the vfio_dma (for mappings backed by a file) and using that to > translate IOVA's. It is certainly the best flow, but it may be difficult. Eg the memfd work for KVM to do something similar is quite involved. > I think that will be easier to use than fork/CHANGE_MM/exec, and may > even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be > continued. Yes, certainly easier to use, I suggested CHANGE_MM because the kernel implementation is very easy, I could send you something to test w/ iommufd in a few hours effort probably. Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically. Jason