On Mon, 25 Nov 2019 19:57:39 -0500 Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote: > > On Fri, 15 Nov 2019 00:26:07 +0530 > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > On 11/14/2019 1:37 AM, Alex Williamson wrote: > > > > On Thu, 14 Nov 2019 01:07:21 +0530 > > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > > > >> On 11/13/2019 4:00 AM, Alex Williamson wrote: > > > >>> On Tue, 12 Nov 2019 22:33:37 +0530 > > > >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > >>> > > > >>>> All pages pinned by vendor driver through vfio_pin_pages API should be > > > >>>> considered as dirty during migration. IOMMU container maintains a list of > > > >>>> all such pinned pages. Added an ioctl defination to get bitmap of such > > > >>> > > > >>> definition > > > >>> > > > >>>> pinned pages for requested IO virtual address range. > > > >>> > > > >>> Additionally, all mapped pages are considered dirty when physically > > > >>> mapped through to an IOMMU, modulo we discussed devices opting in to > > > >>> per page pinning to indicate finer granularity with a TBD mechanism to > > > >>> figure out if any non-opt-in devices remain. > > > >>> > > > >> > > > >> You mean, in case of device direct assignment (device pass through)? > > > > > > > > Yes, or IOMMU backed mdevs. If vfio_dmas in the container are fully > > > > pinned and mapped, then the correct dirty page set is all mapped pages. > > > > We discussed using the vpfn list as a mechanism for vendor drivers to > > > > reduce their migration footprint, but we also discussed that we would > > > > need a way to determine that all participants in the container have > > > > explicitly pinned their working pages or else we must consider the > > > > entire potential working set as dirty. > > > > > > > > > > How can vendor driver tell this capability to iommu module? Any suggestions? > > > > I think it does so by pinning pages. Is it acceptable that if the > > vendor driver pins any pages, then from that point forward we consider > > the IOMMU group dirty page scope to be limited to pinned pages? There > we should also be aware of that dirty page scope is pinned pages + unpinned pages, > which means ever since a page is pinned, it should be regarded as dirty > no matter whether it's unpinned later. only after log_sync is called and > dirty info retrieved, its dirty state should be cleared. Yes, good point. We can't just remove a vpfn when a page is unpinned or else we'd lose information that the page potentially had been dirtied while it was pinned. Maybe that vpfn needs to move to a dirty list and both the currently pinned vpfns and the dirty vpfns are walked on a log_sync. The dirty vpfns list would be cleared after a log_sync. The container would need to know that dirty tracking is enabled and only manage the dirty vpfns list when necessary. Thanks, Alex > > are complications around non-singleton IOMMU groups, but I think we're > > already leaning towards that being a non-worthwhile problem to solve. > > So if we require that only singleton IOMMU groups can pin pages and we > > pass the IOMMU group as a parameter to > > vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a > > flag on its local vfio_group struct to indicate dirty page scope is > > limited to pinned pages. We might want to keep a flag on the > > vfio_iommu struct to indicate if all of the vfio_groups for each > > vfio_domain in the vfio_iommu.domain_list dirty page scope limited to > > pinned pages as an optimization to avoid walking lists too often. Then > > we could test if vfio_iommu.domain_list is not empty and this new flag > > does not limit the dirty page scope, then everything within each > > vfio_dma is considered dirty. > > > > > >>>> Signed-off-by: Kirti Wankhede <kwankhede@xxxxxxxxxx> > > > >>>> Reviewed-by: Neo Jia <cjia@xxxxxxxxxx> > > > >>>> --- > > > >>>> include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++ > > > >>>> 1 file changed, 23 insertions(+) > > > >>>> > > > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > >>>> index 35b09427ad9f..6fd3822aa610 100644 > > > >>>> --- a/include/uapi/linux/vfio.h > > > >>>> +++ b/include/uapi/linux/vfio.h > > > >>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap { > > > >>>> #define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15) > > > >>>> #define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16) > > > >>>> > > > >>>> +/** > > > >>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17, > > > >>>> + * struct vfio_iommu_type1_dirty_bitmap) > > > >>>> + * > > > >>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration. > > > >>>> + * Get dirty pages bitmap of given IO virtual addresses range using > > > >>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of > > > >>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get > > > >>>> + * bitmap and should set size of allocated memory in bitmap_size field. > > > >>>> + * One bit is used to represent per page consecutively starting from iova > > > >>>> + * offset. Bit set indicates page at that offset from iova is dirty. > > > >>>> + */ > > > >>>> +struct vfio_iommu_type1_dirty_bitmap { > > > >>>> + __u32 argsz; > > > >>>> + __u32 flags; > > > >>>> + __u64 iova; /* IO virtual address */ > > > >>>> + __u64 size; /* Size of iova range */ > > > >>>> + __u64 bitmap_size; /* in bytes */ > > > >>> > > > >>> This seems redundant. We can calculate the size of the bitmap based on > > > >>> the iova size. > > > >>> > > > >> > > > >> But in kernel space, we need to validate the size of memory allocated by > > > >> user instead of assuming user is always correct, right? > > > > > > > > What does it buy us for the user to tell us the size? They could be > > > > wrong, they could be malicious. The argsz field on the ioctl is mostly > > > > for the handshake that the user is competent, we should get faults from > > > > the copy-user operation if it's incorrect. > > > > > > > > > > It is to mainly fail safe. > > > > > > >>>> + void __user *bitmap; /* one bit per page */ > > > >>> > > > >>> Should we define that as a __u64* to (a) help with the size > > > >>> calculation, and (b) assure that we can use 8-byte ops on it? > > > >>> > > > >>> However, who defines page size? Is it necessarily the processor page > > > >>> size? A physical IOMMU may support page sizes other than the CPU page > > > >>> size. It might be more important to indicate the expected page size > > > >>> than the bitmap size. Thanks, > > > >>> > > > >> > > > >> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for > > > >> mapping are CPU page size, 4K. Do we still need to have such argument? > > > > > > > > That assumption exists for backwards compatibility prior to supporting > > > > the iova_pgsizes field in vfio_iommu_type1_info. AFAIK the current > > > > interface has no page size assumptions and we should not add any. > > > > > > So userspace has iova_pgsizes information, which can be input to this > > > ioctl. Bitmap should be considering smallest page size. Does that makes > > > sense? > > > > I'm not sure. I thought I had an argument that the iova_pgsize could > > indicate support for sizes smaller than the processor page size, which > > would make the user responsible for using a different base for their > > page size, but vfio_pgsize_bitmap() already masks out sub-page sizes. > > Clearly the vendor driver is pinning based on processor sized pages, > > but that's independent of an IOMMU and not part of a user ABI. > > > > I'm tempted to say your bitmap_size field has a use here, but it seems > > to fail in validating the user page size at the low extremes. For > > example if we have a single page mapping, the user can specify the iova > > size as 4K (for example), but the minimum bitmap_size they can indicate > > is 1 byte, would we therefore assume the user's bitmap page size is 512 > > bytes (ie. they provided us with 8 bits to describe a 4K range)? We'd > > need to be careful to specify that the minimum iova_pgsize indicated > > page size is our lower bound as well. But then what do we do if the > > user provides us with a smaller buffer than we expect? For example, a > > 128MB iova range and only an 8-byte buffer. Do we go ahead and assume > > a 2MB page size and fill the bitmap accordingly or do we generate an > > error? If the latter, might we support that at some point in time and > > is it sufficient to let the user perform trial and error to test if that > > exists? Thanks, > > > > Alex > > >