On 10/6/2022 12:01 PM, Jason Gunthorpe wrote: > On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote: >> On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote: >> >>>> If Steve wants to keep it then someone needs to fix the deadlock in >>>> the vfio implementation before any userspace starts to appear. >>> >>> The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting >>> across exec, which can result in mm->locked_vm becoming negative. I have several >>> fixes, but none result in limits being reached at exactly the same time as before -- >>> the same general issue being discussed for iommufd. I am still thinking about it. >> >> Oh, yeah, I noticed this was all busted up too. >> >>> I am not aware of a deadlock problem. Please elaborate or point me to an >>> email thread. >> >> VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where >> userspace can tigger the lock to be taken and then returns to >> userspace with the lock held. >> >> Any scenario where a kernel thread hits that open-coded lock and then >> userspace does-the-wrong-thing will deadlock the kernel. >> >> For instance consider a mdev driver. We assert >> VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue >> and becomes blocked on the now locked lock. Userspace then tries to >> close the device FD. >> >> FD closure will trigger device close and the VFIO core code >> requirement is that mdev driver device teardown must halt all >> concurrent threads touching vfio_device. Thus the mdev will try to >> fence its workqeue and then deadlock - unable to flush/cancel a work >> that is currently blocked on a lock held by userspace that will never >> be unlocked. >> >> This is just the first scenario that comes to mind. The approach to >> give userspace control of a lock that kernel threads can become >> blocked on is so completely sketchy it is a complete no-go in my >> opinion. If I had seen it when it was posted I would have hard NAK'd >> it. >> >> My "full" solution in mind for iommufd is to pin all the memory upon >> VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests >> while the mm_struct is not available. But IMHO this is basically >> useless for any actual user of mdevs. >> >> The other option is to just exclude mdevs and fail the >> VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from >> becoming present while it is asserted. In this way we don't need to do >> anything beyond a simple check as the iommu_domain is already fully >> populated and pinned. > > Do we have a solution to this? > > If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR > > Aside from the approach to use the FD, another idea is to just use > fork. > > qemu would do something like > > .. stop all container ioctl activity .. > fork() > ioctl(CHANGE_MM) // switch all maps to this mm > .. signal parent.. > .. wait parent.. > exit(0) > .. wait child .. > exec() > ioctl(CHANGE_MM) // switch all maps to this mm > ..signal child.. > waitpid(childpid) > > This way the kernel is never left without a page provider for the > maps, the dummy mm_struct belonging to the fork will serve that role > for the gap. > > And the above is only required if we have mdevs, so we could imagine > userspace optimizing it away for, eg vfio-pci only cases. > > It is not as efficient as using a FD backing, but this is super easy > to implement in the kernel. I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted. * In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false). * When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task. * The task will wake in vfio_wait, see close_flag=true, and return EFAULT to the mdev caller. This requires a little new plumbing. I will work out the details, but if you see a problem with the overall approach, please let me know. - Steve