On 10/12/2022 8:32 AM, Jason Gunthorpe wrote: > On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote: >> On 10/11/2022 8:30 AM, Jason Gunthorpe wrote: >>> On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote: >>>>> Do we have a solution to this? >>>>> >>>>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR >>>>> >>>>> Aside from the approach to use the FD, another idea is to just use >>>>> fork. >>>>> >>>>> qemu would do something like >>>>> >>>>> .. stop all container ioctl activity .. >>>>> fork() >>>>> ioctl(CHANGE_MM) // switch all maps to this mm >>>>> .. signal parent.. >>>>> .. wait parent.. >>>>> exit(0) >>>>> .. wait child .. >>>>> exec() >>>>> ioctl(CHANGE_MM) // switch all maps to this mm >>>>> ..signal child.. >>>>> waitpid(childpid) >>>>> >>>>> This way the kernel is never left without a page provider for the >>>>> maps, the dummy mm_struct belonging to the fork will serve that role >>>>> for the gap. >>>>> >>>>> And the above is only required if we have mdevs, so we could imagine >>>>> userspace optimizing it away for, eg vfio-pci only cases. >>>>> >>>>> It is not as efficient as using a FD backing, but this is super easy >>>>> to implement in the kernel. >>>> >>>> I propose to avoid deadlock for mediated devices as follows. Currently, an >>>> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR >>>> is asserted. >>>> >>>> * In vfio_wait, I will maintain a list of waiters, each list element >>>> consisting of (task, mdev, close_flag=false). >>>> >>>> * When the vfio device descriptor is closed, vfio_device_fops_release >>>> will notify the vfio_iommu driver, which will find the mdev on the waiters >>>> list, set elem->close_flag=true, and call wake_up_process for the task. >>> >>> This alone is not sufficient, the mdev driver can continue to >>> establish new mappings until it's close_device function >>> returns. Killing only existing mappings is racy. >>> >>> I think you are focusing on the one issue I pointed at, as I said, I'm >>> sure there are more ways than just close to abuse this functionality >>> to deadlock the kernel. >>> >>> I continue to prefer we remove it completely and do something more >>> robust. I suggested two options. >> >> It's not racy. New pin requests also land in vfio_wait if any vaddr's have >> been invalidated in any vfio_dma in the iommu. See >> vfio_iommu_type1_pin_pages() >> if (iommu->vaddr_invalid_count) >> vfio_find_dma_valid() >> vfio_wait() > > I mean you can't do a one shot wakeup of only existing waiters, and > you can't corrupt the container to wake up waiters for other devices, > so I don't see how this can be made to work safely... > > It also doesn't solve any flow that doesn't trigger file close, like a > process thread being stuck on the wait in the kernel. eg because a > trapped mmio triggered an access or something. > > So it doesn't seem like a workable direction to me. > >> However, I will investigate saving a reference to the file object in >> the vfio_dma (for mappings backed by a file) and using that to >> translate IOVA's. > > It is certainly the best flow, but it may be difficult. Eg the memfd > work for KVM to do something similar is quite involved. > >> I think that will be easier to use than fork/CHANGE_MM/exec, and may >> even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be >> continued. > > Yes, certainly easier to use, I suggested CHANGE_MM because the kernel > implementation is very easy, I could send you something to test w/ > iommufd in a few hours effort probably. > > Anyhow, I think this conversation has convinced me there is no way to > fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to > it being a security bug, basically. Please do not. Please give me the courtesy of time to develop a replacement before we delete it. Surely you can make progress on other opens areas of iommufd without needing to delete this immediately. - Steve