Re: [PATCH V1 4/5] vfio: VA suspend interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/12/2021 11:10 PM, Alex Williamson wrote:
> On Tue, 12 Jan 2021 15:47:56 -0700
> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> 
>> On Mon, 11 Jan 2021 16:15:02 -0500
>> Steven Sistare <steven.sistare@xxxxxxxxxx> wrote:
>>
>>> On 1/8/2021 4:15 PM, Alex Williamson wrote:  
>>>> On Tue,  5 Jan 2021 07:36:52 -0800
>>>> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote:
>>>>     
>>>>> Add interfaces that allow the underlying memory object of an iova
>>>>> range to be mapped to a new host virtual address in the host process:
>>>>>
>>>>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>>>>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>>>>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION    
>>>>
>>>> Suspend and Resume can imply many things other than what's done here.
>>>> Should these be something more akin to INVALIDATE_VADDR and
>>>> REPLACE_VADDR?    
>>>
>>> Agreed.  I suspected we would discuss the names.  Some possibilities:
>>>
>>> INVALIDATE_VADDR  REPLACE_VADDR
>>> INV_VADDR         SET_VADDR
>>> CLEAR_VADDR       SET_VADDR
>>> SUSPEND_VADDR     RESUME_VADDR
>>>   
>>>>> The suspend interface blocks vfio translation of host virtual
>>>>> addresses in a range, but DMA to already-mapped pages continues.
>>>>> The resume interface records the new base VA and resumes translation.
>>>>> See comments in uapi/linux/vfio.h for more details.
>>>>>
>>>>> This is a partial implementation.  Blocking is added in the next patch.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>>>>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>>>>>  2 files changed, 57 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 3dc501d..2c164a6 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>>>>  	int			prot;		/* IOMMU_READ/WRITE */
>>>>>  	bool			iommu_mapped;
>>>>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>>>> +	bool			suspended;    
>>>>
>>>> Is there a value we could use for vfio_dma.vaddr that would always be
>>>> considered invalid, ex. ULONG_MAX?      
>>>
>>> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
>>> to invalidate, will probably yield equally understandable code.  I'll try it.  
>>
>> Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
>> as part of the uapi, could we implement this with only a single flag on
>> the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
>> to set the vaddr, first to the invalid valid, then to a new value.  It's
>> always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
>> when the mapping is not actually unmapped.  That might lean towards an
>> UPDATE or REPLACE flag.
> 
> I realized you really want to make use of the DMA_UNMAP ioctl in order
> to use ranges, maybe we can make the mental model more coherent with an
> unmap flag like VFIO_DMA_UNMAP_FLAG_VADDR_ONLY, ie. we're only asking
> to unmap the vaddr.  The DMA_MAP side might take a similar VADDR_ONLY
> flag to reset the vaddr.  That also retains your desired semantics that
> we can't "resume" a vaddr that wasn't previously "suspended", we can't
> map a vaddr that wasn't previously unmapped.

I like this.

> For the unmap-all problem, userspace already needs to work around this,
> see for instance QEMU:
> 
> 1b296c3def4b vfio: Don't issue full 2^64 unmap

Yes, I saw that.  And one cannot split the range in two at some arbitrary 
mmu-page-size boundary, least you split a mapping, which V2 does not allow.

> So I wonder really how critical it is and whether it really would be
> sufficient for userspace to track a high water mark for mappings.

If the high water mark is 2^64, then you also need a low water mark to 
enable a single call, else one call cannot span the range.  And low water
better not be 0.  And I would rather not add code to compute either one :)

> Otherwise, I think I'm leaning towards a DMA_UNMAP flag like
> VFIO_DMA_UNMAP_FLAG_UNMAP_ALL that would disregard the iova and size
> fields to apply to all mappings.  Designating special values for iova or
> size trigger extended behavior feels a bit hackish.  Thanks,

The flag sounds best to me.

- Steve

>>>> We'd need to decide if we want to
>>>> allow users to create mappings (mdev-only) using an initial invalid
>>>> vaddr.    
>>>
>>> Maybe.  Not sure yet.  
>>
>> If we used the above, it almost seems strange not to allow it, but at
>> the same time we don't really want to have different rules for
>> different devices types.  An initially valid vaddr doesn't seem
>> unreasonable... though we don't test it until the vendor driver tries
>> to pin or rw pages w/o IOMMU backing.
>>  
>>>>>  	struct task_struct	*task;
>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>  	unsigned long		*bitmap;
>>>>> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>  	int ret = 0, retries = 0;
>>>>>  	unsigned long pgshift;
>>>>>  	dma_addr_t iova;
>>>>> -	unsigned long size;
>>>>> +	unsigned long size, consumed;    
>>>>
>>>> This could be scoped into the branch below.    
>>>
>>> OK.
>>>   
>>>>>  	mutex_lock(&iommu->lock);
>>>>>  
>>>>> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>  		if (dma->task->mm != current->mm)
>>>>>  			break;
>>>>>  
>>>>> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
>>>>> +			if (dma->suspended) {
>>>>> +				ret = -EINVAL;
>>>>> +				goto unlock;
>>>>> +			}    
>>>>
>>>> This leaves us in a state where we marked some entries but not others.
>>>> We should either unwind or... what's the actual harm in skipping these?    
>>>
>>> We could skip them with no ill effect.  However, it likely means the app is confused
>>> and potentially broken, and it would be courteous to inform them so.  I found such bugs
>>> in qemu as I was developing this feature.
>>>
>>> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
>>> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
>>> developer to fix the redundant suspend call.  
>>
>> That sounds like an entirely practical rationalization, but our
>> standard practice is to maintain a consistent state.  If an ioctl fails
>> is should effectively be as if the ioctl was never called, where
>> possible.  Userspace can be broken, and potentially so broken that their
>> best choice is to abort, but we should maintain consistent, predictable
>> behavior.
>>
>>>>> +			dma->suspended = true;
>>>>> +			consumed = dma->iova + dma->size - iova;
>>>>> +			if (consumed >= size)
>>>>> +				break;
>>>>> +			iova += consumed;
>>>>> +			size -= consumed;
>>>>> +			unmapped += dma->size;
>>>>> +			continue;
>>>>> +		}    
>>>>
>>>> This short-cuts the dirty bitmap flag, so we need to decide if it's
>>>> legal to call them together or we need to prevent it... Oh, I see
>>>> you've excluded them earlier below.
>>>>     
>>>>> +
>>>>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>>>>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>>>>>  
>>>>> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>>>>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  			   struct vfio_iommu_type1_dma_map *map)
>>>>>  {
>>>>> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
>>>>>  	dma_addr_t iova = map->iova;
>>>>>  	unsigned long vaddr = map->vaddr;
>>>>>  	size_t size = map->size;
>>>>> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>>>>  		prot |= IOMMU_READ;
>>>>>  
>>>>> +	if ((prot && resume) || (!prot && !resume))
>>>>> +		return -EINVAL;
>>>>> +
>>>>>  	mutex_lock(&iommu->lock);
>>>>>  
>>>>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>>>>>  
>>>>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>>>>>  
>>>>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
>>>>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>>>>>  		ret = -EINVAL;
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>>  
>>>>> -	if (vfio_find_dma(iommu, iova, size)) {
>>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>>> +	if (resume) {
>>>>> +		if (!dma) {
>>>>> +			ret = -ENOENT;
>>>>> +		} else if (!dma->suspended || dma->iova != iova ||
>>>>> +			   dma->size != size) {    
>>>>
>>>> Why is it necessary that the vfio_dma be suspended before being
>>>> resumed?  Couldn't a user simply use this to change the vaddr?  Does
>>>> that promote abusive use?    
>>>
>>> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
>>> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
>>> This could only be safe if the memory object is mapped at both the old vaddr and the new
>>> vaddr concurrently, which is an unlikely use case.  
>>
>> Ok, it's not like the use can't make it instantaneously invalid and then
>> replace it.
>>
>>>>> +			ret = -EINVAL;
>>>>> +		} else {
>>>>> +			dma->vaddr = vaddr;    
>>>>
>>>> Seems like there's a huge opportunity for a user to create coherency
>>>> issues here... it's their data though I guess.    
>>>
>>> Yes.  That's what the language in the uapi about mapping the same memory object is about.
>>>   
>>>>> +			dma->suspended = false;
>>>>> +		}
>>>>> +		goto out_unlock;
>>>>> +	} else if (dma) {
>>>>>  		ret = -EEXIST;
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>>>>  	case VFIO_TYPE1_IOMMU:
>>>>>  	case VFIO_TYPE1v2_IOMMU:
>>>>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>>>> +	case VFIO_SUSPEND:
>>>>>  		return 1;
>>>>>  	case VFIO_DMA_CC_IOMMU:
>>>>>  		if (!iommu)
>>>>> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>>>>>  {
>>>>>  	struct vfio_iommu_type1_dma_map map;
>>>>>  	unsigned long minsz;
>>>>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
>>>>> +			VFIO_DMA_MAP_FLAG_RESUME;
>>>>>  
>>>>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>>  
>>>>> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>>  {
>>>>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>>>>  	struct vfio_bitmap bitmap = { 0 };
>>>>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>>>>> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
>>>>>  	unsigned long minsz;
>>>>>  	int ret;
>>>>>  
>>>>> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>>>>  		return -EFAULT;
>>>>>  
>>>>> -	if (unmap.argsz < minsz ||
>>>>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>>>>> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)    
>>>>
>>>> Maybe a short comment here to note that dirty-bimap and
>>>> suspend/invalidate are mutually exclusive.  Probably should be
>>>> mentioned in the uapi too.    
>>>
>>> Will do, for both.
>>>   
>>>>>  		return -EINVAL;
>>>>>  
>>>>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 896e527..fcf7b56 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -46,6 +46,9 @@
>>>>>   */
>>>>>  #define VFIO_NOIOMMU_IOMMU		8
>>>>>  
>>>>> +/* Supports VFIO DMA suspend and resume */
>>>>> +#define VFIO_SUSPEND			9
>>>>> +
>>>>>  /*
>>>>>   * The IOCTL interface is designed for extensibility by embedding the
>>>>>   * structure length (argsz) and flags into structures passed between
>>>>> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
>>>>>   *
>>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>>> + *
>>>>> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
>>>>> + * resume translation of host virtual addresses in the iova range.  The new
>>>>> + * vaddr must point to the same memory object as the old vaddr, but this is not
>>>>> + * verified.    
>>>>
>>>> It's hard to use "must" terminology here if we're not going to check.
>>>> Maybe the phrasing should be something more along the lines of "should
>>>> point to the same memory object or the user risks coherency issues
>>>> within their virtual address space".    
>>>
>>> I used "must" because it is always incorrect if the object is not the same.  How about:
>>>   The new vaddr must point to the same memory object as the old vaddr, but this is not
>>>   verified.  Violation of this constraint may result in memory corruption within the
>>>   host process and/or guest.  
>>
>> Since the "must" is not relative to the API but to the resulting
>> behavior, perhaps something like:
>>
>>   In order to maintain memory consistency within the user application,
>>   the updated vaddr must address the same memory object as originally
>>   mapped, failure to do so will result in user memory corruption and/or
>>   device misbehavior.
>>
>> Thanks,
>> Alex
>>
>>>>>  iova and size must match those in the original MAP_DMA call.
>>>>> + * Protection is not changed, and the READ & WRITE flags must be 0.    
>>>>
>>>> This doesn't mention that the entry must be previously
>>>> suspended/invalidated (if we choose to keep those semantics).  Thanks,    
>>>
>>> Will add, thanks.
>>>
>>> - Steve   
>>>>>   */
>>>>>  struct vfio_iommu_type1_dma_map {
>>>>>  	__u32	argsz;
>>>>>  	__u32	flags;
>>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>>> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
>>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>>  	__u64	iova;				/* IO virtual address */
>>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>>> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
>>>>>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>>>>>   * pages in the range of unmapped size is returned in the user-provided
>>>>>   * vfio_bitmap.data.
>>>>> + *
>>>>> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
>>>>> + * translation of host virtual addresses in the iova range.  During suspension,
>>>>> + * kernel threads that attempt to translate will block.  DMA to already-mapped
>>>>> + * pages continues.
>>>>>   */
>>>>>  struct vfio_iommu_type1_dma_unmap {
>>>>>  	__u32	argsz;
>>>>>  	__u32	flags;
>>>>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>>>> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
>>>>>  	__u64	iova;				/* IO virtual address */
>>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>>>  	__u8    data[];    
>>>>     
>>>   
>>
> 



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux