Re: [PATCH V6 2/7] vfio/type1: prevent underflow of locked_vm via exec()

Steven Sistare <steven.sistare@xxxxxxxxxx> · Tue, 20 Dec 2022 10:01:21 -0500

On 12/19/2022 2:48 AM, Tian, Kevin wrote:
>> From: Steve Sistare <steven.sistare@xxxxxxxxxx>
>> Sent: Saturday, December 17, 2022 2:51 AM
>>
>> When a vfio container is preserved across exec, the task does not change,
>> but it gets a new mm with locked_vm=0.  If the user later unmaps a dma
>> mapping, locked_vm underflows to a large unsigned value, and a subsequent
>> dma map request fails with ENOMEM in __account_locked_vm.
>>
>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> 
> worth clarifying that locked_vm of the new mm is still not fixed.

Will do.

>> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu
>> *iommu,
>>  	 * against the locked memory limit and we need to be able to do both
>>  	 * outside of this call path as pinning can be asynchronous via the
>>  	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires
>> a
>> -	 * task_struct and VM locked pages requires an mm_struct, however
>> -	 * holding an indefinite mm reference is not recommended,
>> therefore we
>> -	 * only hold a reference to a task.  We could hold a reference to
>> -	 * current, however QEMU uses this call path through vCPU threads,
>> -	 * which can be killed resulting in a NULL mm and failure in the
>> unmap
>> -	 * path when called via a different thread.  Avoid this problem by
>> -	 * using the group_leader as threads within the same group require
>> -	 * both CLONE_THREAD and CLONE_VM and will therefore use the
>> same
>> -	 * mm_struct.
>> +	 * task_struct and VM locked pages requires an mm_struct.
> 
> IMHO the rationale why choosing group_leader still applies...

I don't see why it still applies.  With the new code, we may save a reference 
to current or current->group_leader, without error.  "NULL mm and failure in the 
unmap path" will not happen with mmgrab.  task->signal->rlimit is shared, so it 
does not matter which task we use, or whether the task is dead, as long as 
one of the tasks lives, which is guaranteed by the mmget_not_zero() guard.  Am
I missing something?

I kept current->group_leader for ease of debugging, so that all dma's are owned 
by the same task.

- Steve

> otherwise this looks good to me:
> 
> Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>