Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.

Kirti Wankhede <kwankhede@xxxxxxxxxx> · Thu, 9 Jan 2020 01:31:16 +0530






On 1/8/2020 3:32 AM, Alex Williamson wrote:
On Wed, 8 Jan 2020 01:37:03 +0530
Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:


<snip>

+
+	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
   
   	if (do_accounting)
   		vfio_lock_acct(dma, -unlocked, true);
@@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
   
   		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
   		if (vpfn) {
-			phys_pfn[i] = vpfn->pfn;
-			continue;
+			if (vpfn->unpinned)
+				vfio_remove_from_pfn_list(dma, vpfn);

This seems inefficient, we have an allocated vpfn at the right places
in the list, wouldn't it be better to repin the page?
   

vfio_pin_page_external() takes care of pinning and accounting as well.

Yes, but could we call vfio_pin_page_external() without {unlinking,
freeing} and {re-allocating, linking} on either side of it since it's
already in the list?  That's the inefficient part.  Maybe at least a
TODO comment?


Changing it as below:

                vpfn = vfio_iova_get_vfio_pfn(dma, iova);
                if (vpfn) {
-                       phys_pfn[i] = vpfn->pfn;
-                       continue;
+                       if (vpfn->ref_count > 1) {
+                               phys_pfn[i] = vpfn->pfn;
+                               continue;
+                       }
                }

                remote_vaddr = dma->vaddr + iova - dma->iova;
                ret = vfio_pin_page_external(dma, remote_vaddr, 
&phys_pfn[i],
                                             do_accounting);
                if (ret)
                        goto pin_unwind;
-
-               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
-               if (ret) {
-                       vfio_unpin_page_external(dma, iova, do_accounting);
-                       goto pin_unwind;
-               }
+               if (!vpfn) {
+                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
+                       if (ret) {
+                               vfio_unpin_page_external(dma, iova,
+                                                        do_accounting, 
false);
+                               goto pin_unwind;
+                       }
+               } else
+                       vpfn->pfn = phys_pfn[i];
        }




+			else {
+				phys_pfn[i] = vpfn->pfn;
+				continue;
+			}
   		}
   
   		remote_vaddr = dma->vaddr + iova - dma->iova;
@@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
   
   		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
   		if (ret) {
-			vfio_unpin_page_external(dma, iova, do_accounting);
+			vfio_unpin_page_external(dma, iova, do_accounting,
+						 false);
   			goto pin_unwind;
   		}
   	}

<snip>


+		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+			iommu->dirty_page_tracking = true;
+			return 0;
+		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+			iommu->dirty_page_tracking = false;
+
+			mutex_lock(&iommu->lock);
+			vfio_remove_unpinned_from_dma_list(iommu);
+			mutex_unlock(&iommu->lock);
+			return 0;
+
+		} else if (range.flags &
+				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+			uint64_t iommu_pgmask;
+			unsigned long pgshift = __ffs(range.pgsize);
+			unsigned long *bitmap;
+			long bsize;
+
+			iommu_pgmask =
+			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
+
+			if (((range.pgsize - 1) & iommu_pgmask) !=
+			    (range.pgsize - 1))
+				return -EINVAL;
+
+			if (range.iova & iommu_pgmask)
+				return -EINVAL;
+			if (!range.size || range.size > SIZE_MAX)
+				return -EINVAL;
+			if (range.iova + range.size < range.iova)
+				return -EINVAL;
+
+			bsize = verify_bitmap_size(range.size >> pgshift,
+						   range.bitmap_size);
+			if (bsize < 0)
+				return ret;
+
+			bitmap = kmalloc(bsize, GFP_KERNEL);

I think I remember mentioning previously that we cannot allocate an
arbitrary buffer on behalf of the user, it's far too easy for them to
kill the kernel that way.  kmalloc is also limited in what it can
alloc.

That's the reason added verify_bitmap_size(), so that size is verified

That's only a consistency test, it only verifies that the user claims
to provide a bitmap sized sufficiently for the range they're trying to
request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
Meanwhile the user doesn't actually need to allocate that bitmap in
order to crash the kernel.

Can't we use the user buffer directly or only work on a part of
it at a time?
   

without copy_from_user(), how?

For starters, what's the benefit of copying the bitmap from the user
in the first place?  We presume the data is zero'd and if it's not,
that's the user's bug to sort out (we just need to define the API to
specify that).  Therefore the copy_from_user() is unnecessary anyway and
we really just need to copy_to_user() for any places we're setting
bits.  We could just walk through the range with an unsigned long
bitmap buffer, writing it out to userspace any time we reach the end
with bits set, zeroing it and shifting it as a window to the user
buffer.  We could improve batching by allocating a larger buffer in the
kernel, with a kernel defined maximum size and performing the same
windowing scheme.


Ok removing copy_from_user().
But AFAIK, calling copy_to_user() multiple times is not efficient in 
terms of performance.

Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where 
dirty_bitmap is allocated, that has generic checks, user space address 
check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add 
access_ok check. I already added overflow check.

        /* General sanity checks */
        if (mem->memory_size & (PAGE_SIZE - 1))
                goto out;

       !access_ok((void __user *)(unsigned long)mem->userspace_addr,
                        mem->memory_size)))

        if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
                goto out;

        if (npages > KVM_MEM_MAX_NR_PAGES)
                goto out;


Where KVM_MEM_MAX_NR_PAGES is:

/*
 * Some of the bitops functions do not support too long bitmaps.
 * This number must be determined not to exceed such limits.
 */
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)

But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
Should we define similar limit in vfio module instead of SIZE_MAX?

I don't know if there's a way to directly map the user buffer rather
than copy_to_user(), but I thought I'd ask in case there's some obvious
efficiency improvement to be had there.

+			if (!bitmap)
+				return -ENOMEM;
+
+			ret = copy_from_user(bitmap,
+			     (void __user *)range.bitmap, bsize) ? -EFAULT : 0;
+			if (ret)
+				goto bitmap_exit;
+
+			iommu->dirty_page_tracking = false;

a) This is done outside of the mutex and susceptible to races,

moving inside lock

b) why is this done?
once bitmap is read, dirty_page_tracking should be stopped. Right?

Absolutely not.  Dirty bit page tracking should remain enabled until
the user turns it off, doing otherwise precludes both iterative and
partial dirty page collection from userspace.  I think both of those
are fundamentally required of this interface.  Thanks,


OK.

Thanks,
Kirti