Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow

Steven Sistare <steven.sistare@xxxxxxxxxx> · Tue, 13 Dec 2022 14:40:24 -0500

On 12/13/2022 2:29 PM, Alex Williamson wrote:
> On Tue, 13 Dec 2022 13:21:15 -0500
> Steven Sistare <steven.sistare@xxxxxxxxxx> wrote:
> 
>> On 12/13/2022 1:17 PM, Steven Sistare wrote:
>>> On 12/13/2022 1:02 PM, Alex Williamson wrote:  
>>>> On Tue, 13 Dec 2022 07:46:56 -0800
>>>> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote:
>>>>  
>>>>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
>>>>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>>>>> dma mapping, locked_vm underflows to a large unsigned value, and a
>>>>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
>>>>>
>>>>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
>>>>> add the mapping's pinned page count to the new mm->locked_vm, subject to
>>>>> the rlimit.  Now that mediated devices are excluded when using
>>>>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
>>>>> mapping.
>>>>>
>>>>> Underflow will not occur when all dma mappings are invalidated before exec.
>>>>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
>>>>> will fail with EINVAL because the mapping is in the vaddr_invalid state.  
>>>>
>>>> Where is this enforced?  
>>>
>>> In vfio_dma_do_unmap:
>>>         if (invalidate_vaddr) {
>>>                 if (dma->vaddr_invalid) {
>>>                         ...
>>>                         ret = -EINVAL;  
>>
>> My bad, this is a different case, and my comment in the commit message is
>> incorrect.  I should test mm != dma->mm during unmap as well, and suppress
>> the locked_vm deduction there.
> 
> I'm getting confused how this patch actually does anything.  We grab
> the mm of the task doing mappings, and we swap that grab when updating
> the vaddr, but vfio_lock_acct() uses the original dma->task mm for
> accounting.  Therefore how can an underflow occur?  It seems we're
> simply failing to adjust locked_vm for the new mm at all.

The old code saves dma->task, but not dma->task->mm.  The task's mm changes 
across exec.

>>>>> Underflow may still occur in a buggy application that fails to invalidate
>>>>> all before exec.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>>>>>  1 file changed, 11 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index f81e925..e5a02f8 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>  	struct task_struct	*task;
>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>  	unsigned long		*bitmap;
>>>>> +	struct mm_struct	*mm;
>>>>>  };
>>>>>  
>>>>>  struct vfio_batch {
>>>>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>>>  	vfio_unmap_unpin(iommu, dma, true);
>>>>>  	vfio_unlink_dma(iommu, dma);
>>>>>  	put_task_struct(dma->task);
>>>>> +	mmdrop(dma->mm);
>>>>>  	vfio_dma_bitmap_free(dma);
>>>>>  	if (dma->vaddr_invalid) {
>>>>>  		iommu->vaddr_invalid_count--;
>>>>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  			dma->vaddr = vaddr;
>>>>>  			dma->vaddr_invalid = false;
>>>>>  			iommu->vaddr_invalid_count--;
>>>>> +			if (current->mm != dma->mm) {
>>>>> +				mmdrop(dma->mm);
>>>>> +				dma->mm = current->mm;
>>>>> +				mmgrab(dma->mm);
>>>>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
>>>>> +						     0);  
>>>>
>>>> What does it actually mean if this fails?  The pages are still pinned.
>>>> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,  
>>>
>>> If this fails, the user has locked additional memory after exec and before making
>>> this call -- more than was locked before exec -- and the rlimit is exceeded.
>>> A misbehaving application, which will only hurt itself.
>>>
>>> However, I should reorder these, and check ret before changing the other state.
> 
> The result would then be that the mapping remains with vaddr_invalid on
> error?  Thanks,

Correct.  In theory the app could recover by releasing the extra locked memory that
it grabbed, or increase its rlimit, and then try map_flag_vaddr again.

- Steve

>>>>> +			}
>>>>>  			wake_up_all(&iommu->vaddr_wait);
>>>>>  		}
>>>>>  		goto out_unlock;
>>>>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  	get_task_struct(current->group_leader);
>>>>>  	dma->task = current->group_leader;
>>>>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
>>>>> +	dma->mm = dma->task->mm;
>>>>> +	mmgrab(dma->mm);
>>>>>  
>>>>>  	dma->pfn_list = RB_ROOT;
>>>>>    
>>>>  
>>
>