Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

Steven Sistare <steven.sistare@xxxxxxxxxx> · Mon, 10 Oct 2022 16:54:50 -0400

On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
>> On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
>>
>>>> If Steve wants to keep it then someone needs to fix the deadlock in
>>>> the vfio implementation before any userspace starts to appear. 
>>>
>>> The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
>>> across exec, which can result in mm->locked_vm becoming negative. I have several 
>>> fixes, but none result in limits being reached at exactly the same time as before --
>>> the same general issue being discussed for iommufd.  I am still thinking about it.
>>
>> Oh, yeah, I noticed this was all busted up too.
>>
>>> I am not aware of a deadlock problem.  Please elaborate or point me to an
>>> email thread.
>>
>> VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
>> userspace can tigger the lock to be taken and then returns to
>> userspace with the lock held.
>>
>> Any scenario where a kernel thread hits that open-coded lock and then
>> userspace does-the-wrong-thing will deadlock the kernel.
>>
>> For instance consider a mdev driver. We assert
>> VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
>> and becomes blocked on the now locked lock. Userspace then tries to
>> close the device FD.
>>
>> FD closure will trigger device close and the VFIO core code
>> requirement is that mdev driver device teardown must halt all
>> concurrent threads touching vfio_device. Thus the mdev will try to
>> fence its workqeue and then deadlock - unable to flush/cancel a work
>> that is currently blocked on a lock held by userspace that will never
>> be unlocked.
>>
>> This is just the first scenario that comes to mind. The approach to
>> give userspace control of a lock that kernel threads can become
>> blocked on is so completely sketchy it is a complete no-go in my
>> opinion. If I had seen it when it was posted I would have hard NAK'd
>> it.
>>
>> My "full" solution in mind for iommufd is to pin all the memory upon
>> VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
>> while the mm_struct is not available. But IMHO this is basically
>> useless for any actual user of mdevs.
>>
>> The other option is to just exclude mdevs and fail the
>> VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
>> becoming present while it is asserted. In this way we don't need to do
>> anything beyond a simple check as the iommu_domain is already fully
>> populated and pinned.
> 
> Do we have a solution to this?
> 
> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> 
> Aside from the approach to use the FD, another idea is to just use
> fork.
> 
> qemu would do something like
> 
>  .. stop all container ioctl activity ..
>  fork()
>     ioctl(CHANGE_MM) // switch all maps to this mm
>     .. signal parent.. 
>     .. wait parent..
>     exit(0)
>  .. wait child ..
>  exec()
>  ioctl(CHANGE_MM) // switch all maps to this mm
>  ..signal child..
>  waitpid(childpid)
> 
> This way the kernel is never left without a page provider for the
> maps, the dummy mm_struct belonging to the fork will serve that role
> for the gap.
> 
> And the above is only required if we have mdevs, so we could imagine
> userspace optimizing it away for, eg vfio-pci only cases.
> 
> It is not as efficient as using a FD backing, but this is super easy
> to implement in the kernel.

I propose to avoid deadlock for mediated devices as follows.  Currently, an
mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
is asserted.

  * In vfio_wait, I will maintain a list of waiters, each list element
    consisting of (task, mdev, close_flag=false).

  * When the vfio device descriptor is closed, vfio_device_fops_release
    will notify the vfio_iommu driver, which will find the mdev on the waiters
    list, set elem->close_flag=true, and call wake_up_process for the task.

  * The task will wake in vfio_wait, see close_flag=true, and return EFAULT
    to the mdev caller.

This requires a little new plumbing.  I will work out the details, but if you
see a problem with the overall approach, please let me know.

- Steve