* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > On Mon, 23 Mar 2020 23:01:18 -0400 > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote: > > > * Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > > > > On Mon, 23 Mar 2020 23:24:37 +0530 > > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > > > > > On 3/21/2020 12:29 AM, Alex Williamson wrote: > > > > > > On Sat, 21 Mar 2020 00:12:04 +0530 > > > > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > > > > > > > >> On 3/20/2020 11:31 PM, Alex Williamson wrote: > > > > > >>> On Fri, 20 Mar 2020 23:19:14 +0530 > > > > > >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > >>> > > > > > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote: > > > > > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530 > > > > > >>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > >>>>> > > > > > >> > > > > > >> <snip> > > > > > >> > > > > > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, > > > > > >>>>>> + size_t size, uint64_t pgsize, > > > > > >>>>>> + u64 __user *bitmap) > > > > > >>>>>> +{ > > > > > >>>>>> + struct vfio_dma *dma; > > > > > >>>>>> + unsigned long pgshift = __ffs(pgsize); > > > > > >>>>>> + unsigned int npages, bitmap_size; > > > > > >>>>>> + > > > > > >>>>>> + dma = vfio_find_dma(iommu, iova, 1); > > > > > >>>>>> + > > > > > >>>>>> + if (!dma) > > > > > >>>>>> + return -EINVAL; > > > > > >>>>>> + > > > > > >>>>>> + if (dma->iova != iova || dma->size != size) > > > > > >>>>>> + return -EINVAL; > > > > > >>>>>> + > > > > > >>>>>> + npages = dma->size >> pgshift; > > > > > >>>>>> + bitmap_size = DIRTY_BITMAP_BYTES(npages); > > > > > >>>>>> + > > > > > >>>>>> + /* mark all pages dirty if all pages are pinned and mapped. */ > > > > > >>>>>> + if (dma->iommu_mapped) > > > > > >>>>>> + bitmap_set(dma->bitmap, 0, npages); > > > > > >>>>>> + > > > > > >>>>>> + if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size)) > > > > > >>>>>> + return -EFAULT; > > > > > >>>>> > > > > > >>>>> We still need to reset the bitmap here, clearing and re-adding the > > > > > >>>>> pages that are still pinned. > > > > > >>>>> > > > > > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@xxxxxxx/ > > > > > >>>>> > > > > > >>>> > > > > > >>>> I thought you agreed on my reply to it > > > > > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@xxxxxxxxxx/ > > > > > >>>> > > > > > >>>> > Why re-populate when there will be no change since > > > > > >>>> > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any > > > > > >>>> > pin request while vfio_iova_dirty_bitmap() is still working, it will > > > > > >>>> > wait till iommu->lock is released. Bitmap will be populated when page is > > > > > >>>> > pinned. > > > > > >>> > > > > > >>> As coded, dirty bits are only ever set in the bitmap, never cleared. > > > > > >>> If a page is unpinned between iterations of the user recording the > > > > > >>> dirty bitmap, it should be marked dirty in the iteration immediately > > > > > >>> after the unpinning and not marked dirty in the following iteration. > > > > > >>> That doesn't happen here. We're reporting cumulative dirty pages since > > > > > >>> logging was enabled, we need to be reporting dirty pages since the user > > > > > >>> last retrieved the dirty bitmap. The bitmap should be cleared and > > > > > >>> currently pinned pages re-added after copying to the user. Thanks, > > > > > >>> > > > > > >> > > > > > >> Does that mean, we have to track every iteration? do we really need that > > > > > >> tracking? > > > > > >> > > > > > >> Generally the flow is: > > > > > >> - vendor driver pin x pages > > > > > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages > > > > > >> tracking, then user asks dirty bitmap, x pages reported dirty by > > > > > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag > > > > > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap > > > > > >> consists of x+y bits set > > > > > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not > > > > > >> updated, so again bitmap consists of x+y bits set. > > > > > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped > > > > > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped, > > > > > >> pages should not get dirty by guest driver or the physical device. > > > > > >> Hence, x+y dirty pages would be reported. > > > > > >> > > > > > >> I don't think we need to track every iteration of bitmap reporting. > > > > > > > > > > > > Yes, once a bitmap is read, it's reset. In your example, after > > > > > > unpinning z pages the user should still see a bitmap with x+y pages, > > > > > > but once they've read that bitmap, the next bitmap should be x+y-z. > > > > > > Userspace can make decisions about when to switch from pre-copy to > > > > > > stop-and-copy based on convergence, ie. the slope of the line recording > > > > > > dirty pages per iteration. The implementation here never allows an > > > > > > inflection point, dirty pages reported through vfio would always either > > > > > > be flat or climbing. There might also be a case that an iommu backed > > > > > > device could start pinning pages during the course of a migration, how > > > > > > would the bitmap ever revert from fully populated to only tracking the > > > > > > pinned pages? Thanks, > > > > > > > > > > > > > > > > At KVM forum we discussed this - if guest driver pins say 1024 pages > > > > > before migration starts, during pre-copy phase device can dirty 0 pages > > > > > in best case and 1024 pages in worst case. In that case, user will > > > > > transfer content of 1024 pages during pre-copy phase and in > > > > > stop-and-copy phase also, that will be pages will be copied twice. So we > > > > > decided to only get dirty pages bitmap at stop-and-copy phase. If user > > > > > is going to get dirty pages in stop-and-copy phase only, then that will > > > > > be single iteration. > > > > > There aren't any devices yet that can track sys memory dirty pages. So > > > > > we can go ahead with this patch and support for dirty pages tracking > > > > > during pre-copy phase can be added later when there will be consumers of > > > > > that functionality. > > > > > > > > So if I understand this right, you're expecting the dirty bitmap to > > > > accumulate dirty bits, in perpetuity, so that the user can only > > > > retrieve them once at the end of migration? But if that's the case, > > > > the user could simply choose to not retrieve the bitmap until the end > > > > of migration, the result would be the same. What we have here is that > > > > dirty bits are never cleared, regardless of whether the user has seen > > > > them, which is wrong. Sorry, we had a lot of discussions at KVM forum, > > > > I don't recall this specific one 5 months later and maybe we weren't > > > > considering all aspects. I see the behavior we have here as incorrect, > > > > but it also seems relatively trivial to make correct. I hope the QEMU > > > > code isn't making us go through all this trouble to report a dirty > > > > bitmap that gets thrown away because it expects the final one to be > > > > cumulative since the beginning of dirty logging. Thanks, > > > > > > I remember the discussion that we couldn't track the system memory > > > dirtying with current hardware; so the question then is just to track > > hi Dave > > there are already devices that are able to track the system memory, > > through two ways: > > (1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family > > support". > > (2) hardware method. through hardware internal buffer (as one Intel > > internal hardware not yet to public, but very soon) or through VTD-3.0 > > IOMMU. > > > > we have already had code verified using the two ways to track system memory > > in fine-grained level. > > > > > > > what has been pinned and then ideally put that memory off until the end. > > > (Which is interesting because I don't think we currently have a way > > > to delay RAM pages till the end in qemu). > > > > I think the problem here is that we mixed pinned pages with dirty pages. > > We are reporting dirty pages, pinned pages are just assumed to be dirty. > > > yes, pinned pages for mdev devices are continuously likely to be dirty > > until device stopped. > > But for devices that are able to report dirty pages, dirtied pages > > will be marked again if hardware writes them later. > > > > So, is it good to introduce a capability to let vfio/qemu know how to > > treat the dirty pages? > > Dirty pages are dirty, QEMU doesn't need any special flag, instead we > need to evolve different mechanisms for the vendor driver so that we > can differentiate pages pinned for read vs pages pinned for write. > Perhaps interfaces to pin pages without dirtying them, and a separate > mechanism to dirty a previously pinned-page, ie. promote it permanently > or transiently to a writable page. > > > (1) for devices have no fine-grained dirty page tracking capability > > a. pinned pages are regarded as dirty pages. they are not cleared by > > dirty page query > > b. unpinned pages are regarded as dirty pages. they are cleared by > > dirty page query or UNMAP ioctl. > > (2) for devices that have fine-grained dirty page tracking capability > > a. pinned/unpinned pages are not regarded as dirty pages > > We need a pin-read-only interface for this. > > > b. only pages they reported are regarded as dirty pages and are to be > > cleared by dirty page query and UNMAP ioctl. > > We need a set-dirty or promote-writable interface for this. > > > (3) for dirty pages marking APIs, like vfio_dma_rw()... > > pages marked by them are regared as dirty and are to be cleared by > > dirty page query and UNMAP ioctl > > > > For (1), qemu VFIO only reports dirty page amount and would not transfer > > those pages until last round. > > for (2) and (3), qemu VFIO should report and transfer them in each > > round. > > IMO, QEMU should not be aware of any of this. Userspace has an > interface to retrieve dirtied pages (period). We should adjust the > pages that we report as dirtied to be accurate based on the > capabilities of the vendor driver. We can evolve those internal APIs > between the vendor driver and vfio iommu over time without modifying > this user interface. I'm not sure; if you have a block of memory that's constantly marked dirty in (1) - we need to avoid constantly retransmitting that memory to the destination; there's no point in sending it until the end of the iterations - so it shouldn't even get sent once in the iteration. But at the same time, we can't ignore the fact that those pages are going to be dirty - because that influences the downtime; so we need to know we're going to be getting them later, even if we don't initially mark them as dirty. > > > [I still worry whether migration will be usable with any > > > significant amount of system ram that's pinned in this way; the > > > downside will very easily get above the threshold that people like] > > > > > yes. that's why we have to do multi-round dirty page query and > > transfer and clear the dirty bitmaps in each round for devices that are > > able to track in fine grain. > > and that's why we have to report the amount of dirty pages before > > stop-and-copy phase for mdev devices, so that people are able to know > > the real downtime as much as possible. > > Yes, the dirty bitmap should be accurate to report the pages dirtied > since it was last retrieved and over time we can add internal > interfaces to give vendor drivers more granularity in marking pinned > pages dirty and perhaps even exposing the bitmap to the vendor drivers > to set pages themselves. I don't necessarily think it's worthwhile to > create a new class of dirtied pages to transfer at the end, we're > fighting a losing battle at that point. We should be focusing on > improving the granularity of page dirtying in order to reduce the pages > transferred at the end of migration. Thanks, Dave > Alex -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK