Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Tue, 24 Mar 2020 20:23:04 +0000

* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote:
> On Mon, 23 Mar 2020 23:01:18 -0400
> Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> 
> > On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote:
> > > * Alex Williamson (alex.williamson@xxxxxxxxxx) wrote:  
> > > > On Mon, 23 Mar 2020 23:24:37 +0530
> > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> > > >   
> > > > > On 3/21/2020 12:29 AM, Alex Williamson wrote:  
> > > > > > On Sat, 21 Mar 2020 00:12:04 +0530
> > > > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> > > > > >     
> > > > > >> On 3/20/2020 11:31 PM, Alex Williamson wrote:    
> > > > > >>> On Fri, 20 Mar 2020 23:19:14 +0530
> > > > > >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> > > > > >>>        
> > > > > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote:    
> > > > > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530
> > > > > >>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> > > > > >>>>>           
> > > > > >>
> > > > > >> <snip>
> > > > > >>    
> > > > > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > > >>>>>> +				  size_t size, uint64_t pgsize,
> > > > > >>>>>> +				  u64 __user *bitmap)
> > > > > >>>>>> +{
> > > > > >>>>>> +	struct vfio_dma *dma;
> > > > > >>>>>> +	unsigned long pgshift = __ffs(pgsize);
> > > > > >>>>>> +	unsigned int npages, bitmap_size;
> > > > > >>>>>> +
> > > > > >>>>>> +	dma = vfio_find_dma(iommu, iova, 1);
> > > > > >>>>>> +
> > > > > >>>>>> +	if (!dma)
> > > > > >>>>>> +		return -EINVAL;
> > > > > >>>>>> +
> > > > > >>>>>> +	if (dma->iova != iova || dma->size != size)
> > > > > >>>>>> +		return -EINVAL;
> > > > > >>>>>> +
> > > > > >>>>>> +	npages = dma->size >> pgshift;
> > > > > >>>>>> +	bitmap_size = DIRTY_BITMAP_BYTES(npages);
> > > > > >>>>>> +
> > > > > >>>>>> +	/* mark all pages dirty if all pages are pinned and mapped. */
> > > > > >>>>>> +	if (dma->iommu_mapped)
> > > > > >>>>>> +		bitmap_set(dma->bitmap, 0, npages);
> > > > > >>>>>> +
> > > > > >>>>>> +	if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
> > > > > >>>>>> +		return -EFAULT;    
> > > > > >>>>>
> > > > > >>>>> We still need to reset the bitmap here, clearing and re-adding the
> > > > > >>>>> pages that are still pinned.
> > > > > >>>>>
> > > > > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@xxxxxxx/
> > > > > >>>>>           
> > > > > >>>>
> > > > > >>>> I thought you agreed on my reply to it
> > > > > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@xxxxxxxxxx/
> > > > > >>>>       
> > > > > >>>>    > Why re-populate when there will be no change since
> > > > > >>>>    > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
> > > > > >>>>    > pin request while vfio_iova_dirty_bitmap() is still working, it will
> > > > > >>>>    > wait till iommu->lock is released. Bitmap will be populated when page is
> > > > > >>>>    > pinned.    
> > > > > >>>
> > > > > >>> As coded, dirty bits are only ever set in the bitmap, never cleared.
> > > > > >>> If a page is unpinned between iterations of the user recording the
> > > > > >>> dirty bitmap, it should be marked dirty in the iteration immediately
> > > > > >>> after the unpinning and not marked dirty in the following iteration.
> > > > > >>> That doesn't happen here.  We're reporting cumulative dirty pages since
> > > > > >>> logging was enabled, we need to be reporting dirty pages since the user
> > > > > >>> last retrieved the dirty bitmap.  The bitmap should be cleared and
> > > > > >>> currently pinned pages re-added after copying to the user.  Thanks,
> > > > > >>>        
> > > > > >>
> > > > > >> Does that mean, we have to track every iteration? do we really need that
> > > > > >> tracking?
> > > > > >>
> > > > > >> Generally the flow is:
> > > > > >> - vendor driver pin x pages
> > > > > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages
> > > > > >> tracking, then user asks dirty bitmap, x pages reported dirty by
> > > > > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag
> > > > > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap
> > > > > >> consists of x+y bits set
> > > > > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not
> > > > > >> updated, so again bitmap consists of x+y bits set.
> > > > > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
> > > > > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
> > > > > >> pages should not get dirty by guest driver or the physical device.
> > > > > >> Hence, x+y dirty pages would be reported.
> > > > > >>
> > > > > >> I don't think we need to track every iteration of bitmap reporting.    
> > > > > > 
> > > > > > Yes, once a bitmap is read, it's reset.  In your example, after
> > > > > > unpinning z pages the user should still see a bitmap with x+y pages,
> > > > > > but once they've read that bitmap, the next bitmap should be x+y-z.
> > > > > > Userspace can make decisions about when to switch from pre-copy to
> > > > > > stop-and-copy based on convergence, ie. the slope of the line recording
> > > > > > dirty pages per iteration.  The implementation here never allows an
> > > > > > inflection point, dirty pages reported through vfio would always either
> > > > > > be flat or climbing.  There might also be a case that an iommu backed
> > > > > > device could start pinning pages during the course of a migration, how
> > > > > > would the bitmap ever revert from fully populated to only tracking the
> > > > > > pinned pages?  Thanks,
> > > > > >     
> > > > > 
> > > > > At KVM forum we discussed this - if guest driver pins say 1024 pages 
> > > > > before migration starts, during pre-copy phase device can dirty 0 pages 
> > > > > in best case and 1024 pages in worst case. In that case, user will 
> > > > > transfer content of 1024 pages during pre-copy phase and in 
> > > > > stop-and-copy phase also, that will be pages will be copied twice. So we 
> > > > > decided to only get dirty pages bitmap at stop-and-copy phase. If user 
> > > > > is going to get dirty pages in stop-and-copy phase only, then that will 
> > > > > be single iteration.
> > > > > There aren't any devices yet that can track sys memory dirty pages. So 
> > > > > we can go ahead with this patch and support for dirty pages tracking 
> > > > > during pre-copy phase can be added later when there will be consumers of 
> > > > > that functionality.  
> > > > 
> > > > So if I understand this right, you're expecting the dirty bitmap to
> > > > accumulate dirty bits, in perpetuity, so that the user can only
> > > > retrieve them once at the end of migration?  But if that's the case,
> > > > the user could simply choose to not retrieve the bitmap until the end
> > > > of migration, the result would be the same.  What we have here is that
> > > > dirty bits are never cleared, regardless of whether the user has seen
> > > > them, which is wrong.  Sorry, we had a lot of discussions at KVM forum,
> > > > I don't recall this specific one 5 months later and maybe we weren't
> > > > considering all aspects.  I see the behavior we have here as incorrect,
> > > > but it also seems relatively trivial to make correct.  I hope the QEMU
> > > > code isn't making us go through all this trouble to report a dirty
> > > > bitmap that gets thrown away because it expects the final one to be
> > > > cumulative since the beginning of dirty logging.  Thanks,  
> > > 
> > > I remember the discussion that we couldn't track the system memory
> > > dirtying with current hardware; so the question then is just to track  
> > hi Dave
> > there are already devices that are able to track the system memory,
> > through two ways:
> > (1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family
> > support".
> > (2) hardware method. through hardware internal buffer (as one Intel
> > internal hardware not yet to public, but very soon) or through VTD-3.0
> > IOMMU.
> > 
> > we have already had code verified using the two ways to track system memory
> > in fine-grained level.
> > 
> > 
> > > what has been pinned and then ideally put that memory off until the end.
> > > (Which is interesting because I don't think we currently have  a way
> > > to delay RAM pages till the end in qemu).  
> > 
> > I think the problem here is that we mixed pinned pages with dirty pages.
> 
> We are reporting dirty pages, pinned pages are just assumed to be dirty.
> 
> > yes, pinned pages for mdev devices are continuously likely to be dirty
> > until device stopped.
> > But for devices that are able to report dirty pages, dirtied pages
> > will be marked again if hardware writes them later.
> > 
> > So, is it good to introduce a capability to let vfio/qemu know how to
> > treat the dirty pages?
> 
> Dirty pages are dirty, QEMU doesn't need any special flag, instead we
> need to evolve different mechanisms for the vendor driver so that we
> can differentiate pages pinned for read vs pages pinned for write.
> Perhaps interfaces to pin pages without dirtying them, and a separate
> mechanism to dirty a previously pinned-page, ie. promote it permanently
> or transiently to a writable page.
> 
> > (1) for devices have no fine-grained dirty page tracking capability
> >   a. pinned pages are regarded as dirty pages. they are not cleared by
> >   dirty page query
> >   b. unpinned pages are regarded as dirty pages. they are cleared by
> >   dirty page query or UNMAP ioctl.
> > (2) for devices that have fine-grained dirty page tracking capability
> >    a. pinned/unpinned pages are not regarded as dirty pages
> 
> We need a pin-read-only interface for this.
> 
> >    b. only pages they reported are regarded as dirty pages and are to be
> >    cleared by dirty page query and UNMAP ioctl.
> 
> We need a set-dirty or promote-writable interface for this.
> 
> > (3) for dirty pages marking APIs, like vfio_dma_rw()...
> >    pages marked by them are regared as dirty and are to be cleared by
> >    dirty page query and UNMAP ioctl
> > 
> > For (1), qemu VFIO only reports dirty page amount and would not transfer
> > those pages until last round.
> > for (2) and (3), qemu VFIO should report and transfer them in each
> > round.
> 
> IMO, QEMU should not be aware of any of this.  Userspace has an
> interface to retrieve dirtied pages (period).  We should adjust the
> pages that we report as dirtied to be accurate based on the
> capabilities of the vendor driver.  We can evolve those internal APIs
> between the vendor driver and vfio iommu over time without modifying
> this user interface.

I'm not sure;  if you have a block of memory that's constantly marked
dirty in (1) - we need to avoid constantly retransmitting that memory to
the destination; there's no point in sending it until the end of the
iterations - so it shouldn't even get sent once in the iteration.
But at the same time, we can't ignore the fact that those pages are
going to be dirty - because that influences the downtime; so we need
to know we're going to be getting them later, even if we don't
initially mark them as dirty.

> > > [I still worry whether migration will be usable with any
> > > significant amount of system ram that's pinned in this way; the
> > > downside will very easily get above the threshold that people like]
> > >   
> > yes. that's why we have to do multi-round dirty page query and
> > transfer and clear the dirty bitmaps in each round for devices that are
> > able to track in fine grain.
> > and that's why we have to report the amount of dirty pages before
> > stop-and-copy phase for mdev devices, so that people are able to know
> > the real downtime as much as possible.
> 
> Yes, the dirty bitmap should be accurate to report the pages dirtied
> since it was last retrieved and over time we can add internal
> interfaces to give vendor drivers more granularity in marking pinned
> pages dirty and perhaps even exposing the bitmap to the vendor drivers
> to set pages themselves.  I don't necessarily think it's worthwhile to
> create a new class of dirtied pages to transfer at the end, we're
> fighting a losing battle at that point.  We should be focusing on
> improving the granularity of page dirtying in order to reduce the pages
> transferred at the end of migration.  Thanks,

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK