On Wed, 19 Feb 2020 09:51:32 +0530 Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > On 2/19/2020 3:11 AM, Alex Williamson wrote: > > On Tue, 18 Feb 2020 11:28:53 +0530 > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > >> <snip> > >> > >>>>>>> As I understand the above algorithm, we find a vfio_dma > >>>>>>> overlapping the request and populate the bitmap for that range. Then > >>>>>>> we go back and put_user() for each byte that we touched. We could > >>>>>>> instead simply work on a one byte buffer as we enumerate the requested > >>>>>>> range and do a put_user() ever time we reach the end of it and have bits > >>>>>>> set. That would greatly simplify the above example. But I would expect > >>>>>>> that we're a) more likely to get asked for ranges covering a single > >>>>>>> vfio_dma > >>>>>> > >>>>>> QEMU ask for single vfio_dma during each iteration. > >>>>>> > >>>>>> If we restrict this ABI to cover single vfio_dma only, then it > >>>>>> simplifies the logic here. That was my original suggestion. Should we > >>>>>> think about that again? > >>>>> > >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long > >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while > >>>>> asking for the dirty bitmap has even further restricted semantics. I'm > >>>>> also reluctant to design an ABI around what happens to be the current > >>>>> QEMU implementation. > >>>>> > >>>>> If we take your example above, ranges {0x0000,0xa000} and > >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the > >>>>> following two bitmaps in this implementation: > >>>>> > >>>>> 00000011 11111111b > >>>>> 00111111b > >>>>> > >>>>> And we need to combine those into: > >>>>> > >>>>> 11111111 11111111b > >>>>> > >>>>> Right? > >>>>> > >>>>> But it seems like that would be easier if the second bitmap was instead: > >>>>> > >>>>> 11111100b > >>>>> > >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by > >>>>> the bit offset within the byte, which limits our fixes to the boundary > >>>>> byte and allows us to use copy_to_user() directly for the bulk of the > >>>>> copy. So how do we get there? > >>>>> > >>>>> I think we start with allocating the vfio_dma bitmap to account for > >>>>> this initial offset, so we calculate bitmap_base_iova as: > >>>>> (iova & ~((PAGE_SIZE << 3) - 1)) > >>>>> We then use bitmap_base_iova in calculating which bits to set. > >>>>> > >>>>> The user needs to follow the same rules, and maybe this adds some value > >>>>> to the user providing the bitmap size rather than the kernel > >>>>> calculating it. For example, if the user wanted the dirty bitmap for > >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte > >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty. > >>>>> > >>>>> Effectively the user can ask for any iova range, but the buffer will be > >>>>> filled relative to the zeroth bit of the bitmap following the above > >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user > >>>>> requested pgsize). I'm tempted to make this explicit in the user > >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a > >>>>> user is able to map and unmap single pages and we need to support > >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that. > >>>>> > >>>> > >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than > >>>> this. > >>> > >>> How does KVM do this? My intent was that if all of our bitmaps share > >>> the same alignment then we can merge the intersection and continue to > >>> use copy_to_user() on either side. However, if QEMU doesn't do the > >>> same, it doesn't really help us. Is QEMU stuck with an implementation > >>> of only retrieving dirty bits per MemoryRegionSection exactly because > >>> of this issue and therefore we can rely on it in our implementation as > >>> well? Thanks, > >>> > >> > >> QEMU sync dirty_bitmap per MemoryRegionSection. Within > >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries > >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot. > >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls > >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap > >> of that memSlot. > >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection > >> in our implementation. But to get bitmap during unmap, we have to take > >> care of concatenating bitmaps. > > > > So KVM does not worry about bitmap alignment because the interface is > > based on slots, a dirty bitmap can only be retrieved for a single, > > entire slot. We need VFIO_IOMMU_UNMAP_DMA to maintain its support for > > spanning multiple vfio_dmas, but maybe we have some leeway that we > > don't need to support both multiple vfio_dmas and dirty bitmap at the > > same time. It seems like it would be a massive simplification if we > > required an unmap with dirty bitmap to span exactly one vfio_dma, > > right? > > Yes. > > > I don't see that we'd break any existing users with that, it's > > unfortunate that we can't have the flexibility of the existing calling > > convention, but I think there's good reason for it here. Our separate > > dirty bitmap log reporting would follow the same semantics. I think > > this all aligns with how the MemoryListener works in QEMU right now, > > correct? For example we wouldn't need any extra per MAP_DMA tracking > > in QEMU like KVM has for its slots. > > > > That right. > Should we go ahead with the implementation to get dirty bitmap for one > vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we > can have sanity checks in these ioctls. Yes, I'm convinced that bitmap alignment is sufficiently too difficult and unnecessary to restrict the calling convention of UNMAP_DMA, when using the dirty bitmap extension, to exactly unmap a single vfio_dma. Thanks, Alex