On Thu, 9 Jan 2020 18:59:40 +0530 Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > On 1/9/2020 3:59 AM, Alex Williamson wrote: > > On Thu, 9 Jan 2020 01:31:16 +0530 > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > >> On 1/8/2020 3:32 AM, Alex Williamson wrote: > >>> On Wed, 8 Jan 2020 01:37:03 +0530 > >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >>> > >> > >> <snip> > >> > >>>>>> + > >>>>>> + unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking); > >>>>>> > >>>>>> if (do_accounting) > >>>>>> vfio_lock_acct(dma, -unlocked, true); > >>>>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, > >>>>>> > >>>>>> vpfn = vfio_iova_get_vfio_pfn(dma, iova); > >>>>>> if (vpfn) { > >>>>>> - phys_pfn[i] = vpfn->pfn; > >>>>>> - continue; > >>>>>> + if (vpfn->unpinned) > >>>>>> + vfio_remove_from_pfn_list(dma, vpfn); > >>>>> > >>>>> This seems inefficient, we have an allocated vpfn at the right places > >>>>> in the list, wouldn't it be better to repin the page? > >>>>> > >>>> > >>>> vfio_pin_page_external() takes care of pinning and accounting as well. > >>> > >>> Yes, but could we call vfio_pin_page_external() without {unlinking, > >>> freeing} and {re-allocating, linking} on either side of it since it's > >>> already in the list? That's the inefficient part. Maybe at least a > >>> TODO comment? > >>> > >> > >> Changing it as below: > >> > >> vpfn = vfio_iova_get_vfio_pfn(dma, iova); > >> if (vpfn) { > >> - phys_pfn[i] = vpfn->pfn; > >> - continue; > >> + if (vpfn->ref_count > 1) { > >> + phys_pfn[i] = vpfn->pfn; > >> + continue; > >> + } > >> } > >> > >> remote_vaddr = dma->vaddr + iova - dma->iova; > >> ret = vfio_pin_page_external(dma, remote_vaddr, > >> &phys_pfn[i], > >> do_accounting); > >> if (ret) > >> goto pin_unwind; > >> - > >> - ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); > >> - if (ret) { > >> - vfio_unpin_page_external(dma, iova, do_accounting); > >> - goto pin_unwind; > >> - } > >> + if (!vpfn) { > >> + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); > >> + if (ret) { > >> + vfio_unpin_page_external(dma, iova, > >> + do_accounting, > >> false); > >> + goto pin_unwind; > >> + } > >> + } else > >> + vpfn->pfn = phys_pfn[i]; > >> } > >> > >> > >> > >> > >>>>>> + else { > >>>>>> + phys_pfn[i] = vpfn->pfn; > >>>>>> + continue; > >>>>>> + } > >>>>>> } > >>>>>> > >>>>>> remote_vaddr = dma->vaddr + iova - dma->iova; > >>>>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, > >>>>>> > >>>>>> ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); > >>>>>> if (ret) { > >>>>>> - vfio_unpin_page_external(dma, iova, do_accounting); > >>>>>> + vfio_unpin_page_external(dma, iova, do_accounting, > >>>>>> + false); > >>>>>> goto pin_unwind; > >>>>>> } > >>>>>> } > >> > >> <snip> > >> > >>>> > >>>>>> + if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) { > >>>>>> + iommu->dirty_page_tracking = true; > >>>>>> + return 0; > >>>>>> + } else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) { > >>>>>> + iommu->dirty_page_tracking = false; > >>>>>> + > >>>>>> + mutex_lock(&iommu->lock); > >>>>>> + vfio_remove_unpinned_from_dma_list(iommu); > >>>>>> + mutex_unlock(&iommu->lock); > >>>>>> + return 0; > >>>>>> + > >>>>>> + } else if (range.flags & > >>>>>> + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) { > >>>>>> + uint64_t iommu_pgmask; > >>>>>> + unsigned long pgshift = __ffs(range.pgsize); > >>>>>> + unsigned long *bitmap; > >>>>>> + long bsize; > >>>>>> + > >>>>>> + iommu_pgmask = > >>>>>> + ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1; > >>>>>> + > >>>>>> + if (((range.pgsize - 1) & iommu_pgmask) != > >>>>>> + (range.pgsize - 1)) > >>>>>> + return -EINVAL; > >>>>>> + > >>>>>> + if (range.iova & iommu_pgmask) > >>>>>> + return -EINVAL; > >>>>>> + if (!range.size || range.size > SIZE_MAX) > >>>>>> + return -EINVAL; > >>>>>> + if (range.iova + range.size < range.iova) > >>>>>> + return -EINVAL; > >>>>>> + > >>>>>> + bsize = verify_bitmap_size(range.size >> pgshift, > >>>>>> + range.bitmap_size); > >>>>>> + if (bsize < 0) > >>>>>> + return ret; > >>>>>> + > >>>>>> + bitmap = kmalloc(bsize, GFP_KERNEL); > >>>>> > >>>>> I think I remember mentioning previously that we cannot allocate an > >>>>> arbitrary buffer on behalf of the user, it's far too easy for them to > >>>>> kill the kernel that way. kmalloc is also limited in what it can > >>>>> alloc. > >>>> > >>>> That's the reason added verify_bitmap_size(), so that size is verified > >>> > >>> That's only a consistency test, it only verifies that the user claims > >>> to provide a bitmap sized sufficiently for the range they're trying to > >>> request. range.size is limited to SIZE_MAX, so 2^64, divided by page > >>> size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd > >>> try to kmalloc (512TB). kmalloc is good for a couple MB AIUI. > >>> Meanwhile the user doesn't actually need to allocate that bitmap in > >>> order to crash the kernel. > >>> > >>>>> Can't we use the user buffer directly or only work on a part of > >>>>> it at a time? > >>>>> > >>>> > >>>> without copy_from_user(), how? > >>> > >>> For starters, what's the benefit of copying the bitmap from the user > >>> in the first place? We presume the data is zero'd and if it's not, > >>> that's the user's bug to sort out (we just need to define the API to > >>> specify that). Therefore the copy_from_user() is unnecessary anyway and > >>> we really just need to copy_to_user() for any places we're setting > >>> bits. We could just walk through the range with an unsigned long > >>> bitmap buffer, writing it out to userspace any time we reach the end > >>> with bits set, zeroing it and shifting it as a window to the user > >>> buffer. We could improve batching by allocating a larger buffer in the > >>> kernel, with a kernel defined maximum size and performing the same > >>> windowing scheme. > >>> > >> > >> Ok removing copy_from_user(). > >> But AFAIK, calling copy_to_user() multiple times is not efficient in > >> terms of performance. > > > > Right, but even with a modestly sized internal buffer for batching we > > can cover quite a large address space. 128MB for a 4KB buffer, 32GB > > with 1MB buffer. __put_user() is more lightweight than copy_to_user(), > > I wonder where the inflection point is in batching the latter versus > > more iterations of the former. > > > >> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where > >> dirty_bitmap is allocated, that has generic checks, user space address > >> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add > >> access_ok check. I already added overflow check. > >> > >> /* General sanity checks */ > >> if (mem->memory_size & (PAGE_SIZE - 1)) > >> goto out; > >> > >> !access_ok((void __user *)(unsigned long)mem->userspace_addr, > >> mem->memory_size))) > >> > >> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) > >> goto out; > >> > >> if (npages > KVM_MEM_MAX_NR_PAGES) > >> goto out; > >> > >> > >> Where KVM_MEM_MAX_NR_PAGES is: > >> > >> /* > >> * Some of the bitops functions do not support too long bitmaps. > >> * This number must be determined not to exceed such limits. > >> */ > >> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1) > >> > >> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module. > >> Should we define similar limit in vfio module instead of SIZE_MAX? > > > > If we have ranges that are up to 2^31 pages, that's still 2^28 bytes. > > Does it seem reasonable to have a kernel interface that potentially > > allocates 256MB of kernel space with kmalloc accessible to users? That > > still seems like a DoS attack vector to me, especially since the user > > doesn't need to be able to map that much memory (8TB) to access it. > > > > I notice that KVM allocate the bitmap (kvzalloc) relative to the actual > > size of the memory slot when dirty logging is enabled, maybe that's the > > right approach rather than walking vpfn lists and maintaining unpinned > > vpfns for the purposes of tracking. For example, when dirty logging is > > enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the > > vpfn list is not empty and walk the vpfn list to set dirty bits in the > > bitmap. > > Bitmap will be allocated per vfio_dma, not as per > VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP request, right? Per vfio_dma when dirty logging is enabled, ie. between VFIO_IOMMU_DIRTY_PAGES_FLAG_START and VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP. > > When new pages are pinned, allocate a bitmap if not already > > present and set the dirty bit. When unpinned, update the vpfn list but > > leave the dirty bit set. When the dirty bitmap is read, copy out the > > current bitmap to the user, memset it to zero, then re-walk the vpfn > > list to set currently dirty pages. > > Why re-walk is required again? Pinning /unpinning or reporting dirty > pages are done holding iommu->lock, there shouldn't be race condition. In order to "remember" that a page was dirtied, I proposed above that we don't change the bitmap when a page is unpinned. We can "forget" that a page was dirtied if it's no longer pinned and we've told the user about it. Therefore we need to purge the not-currently-pinned pages from the bitmap and rebuild it. > > A vfio_dma without a dirty bitmap > > would consider the entire range dirty. > > That will depend on (!iommu->pinned_page_dirty_scope && > dma->iommu_mapped) condition to mark entire range dirty. I assumed we wouldn't bother to maintain a bitmap unless these conditions are already met. > Here even if vpfn list is empty, memory for dirty_bitmap need to be > allocated, memset all bits to 1, then copy_to_user(). I was assuming we might use a __put_user() loop to fill such ranges rather than waste memory tracking fully populated bitmaps. > If we go with this approach, then I think there should be restriction to > get bitmap as per the way mappings were created, multiple mappings can > be clubbed together, but shouldn't bisect the mappings - similar to > unmap restriction. Why? It seems like it's just some pointer arithmetic to copy out the section of the bitmap that the user requests. Thanks, Alex