Re: [PATCH v14 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Sun, 22 Mar 2020 21:10:41 -0400

On Sat, Mar 21, 2020 at 03:28:21AM +0800, Alex Williamson wrote:
> On Sat, 21 Mar 2020 00:44:32 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> 
> > On 3/20/2020 9:17 PM, Alex Williamson wrote:
> > > On Fri, 20 Mar 2020 09:40:39 -0600
> > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> > >   
> > >> On Fri, 20 Mar 2020 04:35:29 -0400
> > >> Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > >>  
> > >>> On Thu, Mar 19, 2020 at 03:41:12AM +0800, Kirti Wankhede wrote:  
> > >>>> DMA mapped pages, including those pinned by mdev vendor drivers, might
> > >>>> get unpinned and unmapped while migration is active and device is still
> > >>>> running. For example, in pre-copy phase while guest driver could access
> > >>>> those pages, host device or vendor driver can dirty these mapped pages.
> > >>>> Such pages should be marked dirty so as to maintain memory consistency
> > >>>> for a user making use of dirty page tracking.
> > >>>>
> > >>>> To get bitmap during unmap, user should set flag
> > >>>> VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP, bitmap memory should be allocated and
> > >>>> zeroed by user space application. Bitmap size and page size should be set
> > >>>> by user application.
> > >>>>
> > >>>> Signed-off-by: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> > >>>> Reviewed-by: Neo Jia <cjia@xxxxxxxxxx>
> > >>>> ---
> > >>>>   drivers/vfio/vfio_iommu_type1.c | 55 ++++++++++++++++++++++++++++++++++++++---
> > >>>>   include/uapi/linux/vfio.h       | 11 +++++++++
> > >>>>   2 files changed, 62 insertions(+), 4 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >>>> index d6417fb02174..aa1ac30f7854 100644
> > >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> > >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >>>> @@ -939,7 +939,8 @@ static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size)
> > >>>>   }
> > >>>>   
> > >>>>   static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>>> -			     struct vfio_iommu_type1_dma_unmap *unmap)
> > >>>> +			     struct vfio_iommu_type1_dma_unmap *unmap,
> > >>>> +			     struct vfio_bitmap *bitmap)
> > >>>>   {
> > >>>>   	uint64_t mask;
> > >>>>   	struct vfio_dma *dma, *dma_last = NULL;
> > >>>> @@ -990,6 +991,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>>>   	 * will be returned if these conditions are not met.  The v2 interface
> > >>>>   	 * will only return success and a size of zero if there were no
> > >>>>   	 * mappings within the range.
> > >>>> +	 *
> > >>>> +	 * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap request
> > >>>> +	 * must be for single mapping. Multiple mappings with this flag set is
> > >>>> +	 * not supported.
> > >>>>   	 */
> > >>>>   	if (iommu->v2) {
> > >>>>   		dma = vfio_find_dma(iommu, unmap->iova, 1);
> > >>>> @@ -997,6 +1002,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>>>   			ret = -EINVAL;
> > >>>>   			goto unlock;
> > >>>>   		}
> > >>>> +
> > >>>> +		if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> > >>>> +		    (dma->iova != unmap->iova || dma->size != unmap->size)) {  
> > >>> dma is probably NULL here!  
> > >>
> > >> Yep, I didn't look closely enough there.  This is situated right
> > >> between the check to make sure we're not bisecting a mapping at the
> > >> start of the unmap and the check to make sure we're not bisecting a
> > >> mapping at the end of the unmap.  There's no guarantee that we have a
> > >> valid pointer here.  The test should be in the while() loop below this
> > >> code.  
> > > 
> > > Actually the test could remain here, we can exit here if we can't find
> > > a dma at the start of the unmap range with the GET_DIRTY_BITMAP flag,
> > > but we absolutely cannot deref dma without testing it.
> > >   
> > 
> > In the check above newly added check, if dma is NULL then its an error 
> > condition, because Unmap requests must fully cover previous mappings, right?
> 
> Yes, but we'll do a null pointer deref before we return error.
>  
> > >>> And this restriction on UNMAP would make some UNMAP operations of vIOMMU
> > >>> fail.
> > >>>
> > >>> e.g. below condition indeed happens in reality.
> > >>> an UNMAP ioctl comes for IOVA range from 0xff800000, of size 0x200000
> > >>> However, IOVAs in this range are mapped page by page.i.e., dma->size is 0x1000.
> > >>>
> > >>> Previous, this UNMAP ioctl could unmap successfully as a whole.  
> > >>
> > >> What triggers this in the guest?  Note that it's only when using the
> > >> GET_DIRTY_BITMAP flag that this is restricted.  Does the event you're
> > >> referring to potentially occur under normal circumstances in that mode?
> > >> Thanks,
> > >>  

it happens in vIOMMU Domain level invalidation of IOTLB
(domain-selective invalidation, see vtd_iotlb_domain_invalidate() in qemu).
common in VTD lazy mode, and NOT just happening once at boot time.
rather than invalidate page by page, it batches the page invalidation.
so, when this invalidation takes place, even higher level page tables
have been invalid and therefore it has to invalidate a bigger combined range.
That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M
pages.

I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right?
> > 
> > Such unmap would callback vfio_iommu_map_notify() in QEMU. In 
> > vfio_iommu_map_notify(), unmap is called on same range <iova, 
> > iotlb->addr_mask + 1> which was used for map. Secondly unmap with bitmap 
> > will be called only when device state has _SAVING flag set.
> 
in this case, iotlb->addr_mask in unmap is 0x200000 -1.
different than 0x1000 -1 used for map.
> It might be helpful for Yan, and everyone else, to see the latest QEMU
> patch series.  Thanks,
>
yes, please. also curious of log_sync part for vIOMMU. given most IOVAs in
address space are unmapped and therefore no IOTLBs are able to be found.

Thanks
Yan