* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > On Mon, 23 Mar 2020 23:24:37 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > On 3/21/2020 12:29 AM, Alex Williamson wrote: > > > On Sat, 21 Mar 2020 00:12:04 +0530 > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > >> On 3/20/2020 11:31 PM, Alex Williamson wrote: > > >>> On Fri, 20 Mar 2020 23:19:14 +0530 > > >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > >>> > > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote: > > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530 > > >>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > >>>>> > > >> > > >> <snip> > > >> > > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, > > >>>>>> + size_t size, uint64_t pgsize, > > >>>>>> + u64 __user *bitmap) > > >>>>>> +{ > > >>>>>> + struct vfio_dma *dma; > > >>>>>> + unsigned long pgshift = __ffs(pgsize); > > >>>>>> + unsigned int npages, bitmap_size; > > >>>>>> + > > >>>>>> + dma = vfio_find_dma(iommu, iova, 1); > > >>>>>> + > > >>>>>> + if (!dma) > > >>>>>> + return -EINVAL; > > >>>>>> + > > >>>>>> + if (dma->iova != iova || dma->size != size) > > >>>>>> + return -EINVAL; > > >>>>>> + > > >>>>>> + npages = dma->size >> pgshift; > > >>>>>> + bitmap_size = DIRTY_BITMAP_BYTES(npages); > > >>>>>> + > > >>>>>> + /* mark all pages dirty if all pages are pinned and mapped. */ > > >>>>>> + if (dma->iommu_mapped) > > >>>>>> + bitmap_set(dma->bitmap, 0, npages); > > >>>>>> + > > >>>>>> + if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size)) > > >>>>>> + return -EFAULT; > > >>>>> > > >>>>> We still need to reset the bitmap here, clearing and re-adding the > > >>>>> pages that are still pinned. > > >>>>> > > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@xxxxxxx/ > > >>>>> > > >>>> > > >>>> I thought you agreed on my reply to it > > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@xxxxxxxxxx/ > > >>>> > > >>>> > Why re-populate when there will be no change since > > >>>> > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any > > >>>> > pin request while vfio_iova_dirty_bitmap() is still working, it will > > >>>> > wait till iommu->lock is released. Bitmap will be populated when page is > > >>>> > pinned. > > >>> > > >>> As coded, dirty bits are only ever set in the bitmap, never cleared. > > >>> If a page is unpinned between iterations of the user recording the > > >>> dirty bitmap, it should be marked dirty in the iteration immediately > > >>> after the unpinning and not marked dirty in the following iteration. > > >>> That doesn't happen here. We're reporting cumulative dirty pages since > > >>> logging was enabled, we need to be reporting dirty pages since the user > > >>> last retrieved the dirty bitmap. The bitmap should be cleared and > > >>> currently pinned pages re-added after copying to the user. Thanks, > > >>> > > >> > > >> Does that mean, we have to track every iteration? do we really need that > > >> tracking? > > >> > > >> Generally the flow is: > > >> - vendor driver pin x pages > > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages > > >> tracking, then user asks dirty bitmap, x pages reported dirty by > > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag > > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap > > >> consists of x+y bits set > > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not > > >> updated, so again bitmap consists of x+y bits set. > > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped > > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped, > > >> pages should not get dirty by guest driver or the physical device. > > >> Hence, x+y dirty pages would be reported. > > >> > > >> I don't think we need to track every iteration of bitmap reporting. > > > > > > Yes, once a bitmap is read, it's reset. In your example, after > > > unpinning z pages the user should still see a bitmap with x+y pages, > > > but once they've read that bitmap, the next bitmap should be x+y-z. > > > Userspace can make decisions about when to switch from pre-copy to > > > stop-and-copy based on convergence, ie. the slope of the line recording > > > dirty pages per iteration. The implementation here never allows an > > > inflection point, dirty pages reported through vfio would always either > > > be flat or climbing. There might also be a case that an iommu backed > > > device could start pinning pages during the course of a migration, how > > > would the bitmap ever revert from fully populated to only tracking the > > > pinned pages? Thanks, > > > > > > > At KVM forum we discussed this - if guest driver pins say 1024 pages > > before migration starts, during pre-copy phase device can dirty 0 pages > > in best case and 1024 pages in worst case. In that case, user will > > transfer content of 1024 pages during pre-copy phase and in > > stop-and-copy phase also, that will be pages will be copied twice. So we > > decided to only get dirty pages bitmap at stop-and-copy phase. If user > > is going to get dirty pages in stop-and-copy phase only, then that will > > be single iteration. > > There aren't any devices yet that can track sys memory dirty pages. So > > we can go ahead with this patch and support for dirty pages tracking > > during pre-copy phase can be added later when there will be consumers of > > that functionality. > > So if I understand this right, you're expecting the dirty bitmap to > accumulate dirty bits, in perpetuity, so that the user can only > retrieve them once at the end of migration? But if that's the case, > the user could simply choose to not retrieve the bitmap until the end > of migration, the result would be the same. What we have here is that > dirty bits are never cleared, regardless of whether the user has seen > them, which is wrong. Sorry, we had a lot of discussions at KVM forum, > I don't recall this specific one 5 months later and maybe we weren't > considering all aspects. I see the behavior we have here as incorrect, > but it also seems relatively trivial to make correct. I hope the QEMU > code isn't making us go through all this trouble to report a dirty > bitmap that gets thrown away because it expects the final one to be > cumulative since the beginning of dirty logging. Thanks, I remember the discussion that we couldn't track the system memory dirtying with current hardware; so the question then is just to track what has been pinned and then ideally put that memory off until the end. (Which is interesting because I don't think we currently have a way to delay RAM pages till the end in qemu). [I still worry whether migration will be usable with any significant amount of system ram that's pinned in this way; the downside will very easily get above the threshold that people like] Dave > Alex -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK