Re: [PATCH V7 2/7] vfio/type1: prevent underflow of locked_vm via exec()

Steven Sistare <steven.sistare@xxxxxxxxxx> · Mon, 9 Jan 2023 16:16:18 -0500

On 1/9/2023 10:54 AM, Steven Sistare wrote:
> On 1/9/2023 8:52 AM, Jason Gunthorpe wrote:
>> On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
>>> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
>>>> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>>>>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>>>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>>>>> When a vfio container is preserved across exec, the task does not change,
>>>>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>>>>> ENOMEM in __account_locked_vm.
>>>>>>>
>>>>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>>>>
>>>>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>>>>
>>>>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>>>>>>> Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>
>>>>>>> ---
>>>>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> index 144f5bb..71f980b 100644
>>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>>>  	struct task_struct	*task;
>>>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>>>  	unsigned long		*bitmap;
>>>>>>> +	struct mm_struct	*mm;
>>>>>>>  };
>>>>>>>  
>>>>>>>  struct vfio_batch {
>>>>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>>  	if (!npage)
>>>>>>>  		return 0;
>>>>>>>  
>>>>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>>>>> -	if (!mm)
>>>>>>> +	mm = dma->mm;
>>>>>>> +	if (async && !mmget_not_zero(mm))
>>>>>>>  		return -ESRCH; /* process exited */
>>>>>>
>>>>>> Just delete the async, the lock_acct always acts on the dma which
>>>>>> always has a singular mm.
>>>>>>
>>>>>> FIx the few callers that need it to do the mmget_no_zero() before
>>>>>> calling in.
>>>>>
>>>>> Most of the callers pass async=true:
>>>>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>>   ret = vfio_lock_acct(dma, 1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   vfio_lock_acct(dma, -1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>>>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>
>>>> Seems like if you make a lock_sub_acct() function that does the -1*
>>>> and does the mmget it will be OK?
>>>
>>> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
>>>
>>>     vfio_lock_acct()
>>>     {
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>     }
>>>
>>>     vfio_lock_acct_async()
>>>     {
>>>         mmget_not_zero(dma->mm)
>>>
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>
>>>         mmput(dma->mm);
>>>     }
>>
>> I was thinking more like 
>>
>> ()
>>        mmget_not_zero(dma->mm)
>> 	 mm->locked_vm -= npage 
>          ^^^^^^
> Is this shorthand for open coding __account_locked_vm?  If so, we are
> essentially saying the same thing.  My function vfio_lock_acct_async calls 
> mm_lock_acct which calls __account_locked_vm.
> 
> But, your vfio_lock_acct_subtract does not call mmput, so maybe I still don't
> grok your suggestion.
> 
> FWIW here are my functions with all error checking:
> 
> static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>                         bool lock_cap, long npage)
> {
>         int ret = mmap_write_lock_killable(mm);
> 
>         if (!ret) {
>                 ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
>                                           lock_cap);
>                 mmap_write_unlock(mm);
>         }
> 
>         return ret;
> }
> 
> static int vfio_lock_acct(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         return ret;
> }
> 
> static int vfio_lock_acct_async(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         if (!mmget_not_zero(dma->mm))
>                 return -ESRCH; /* process exited */
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         mmput(dma->mm);
> 
>         return ret;
> }

Let's leave the async arg and vfio_lock_acct as is.  We are over-polishing a small
style issue, in pre-existing code, in a soon-to-be dead-end code base.

- Steve