RE: [PATCH vfio 0/7] Enhances the vfio-virtio driver to support live migration

Parav Pandit <parav@xxxxxxxxxx> · Mon, 28 Oct 2024 17:46:57 +0000

> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Monday, October 28, 2024 10:24 PM
> 
> On Mon, 28 Oct 2024 13:23:54 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> >
> > > If the virtio spec doesn't support partial contexts, what makes it
> > > beneficial here?
> >
> > It stil lets the receiver 'warm up', like allocating memory and
> > approximately sizing things.
> >
> > > If it is beneficial, why is it beneficial to send initial data more than
> > > once?
> >
> > I guess because it is allowed to change and the benefit is highest
> > when the pre copy data closely matches the final data..
> 
> It would be useful to see actual data here.  For instance, what is the latency
> advantage to allocating anything in the warm-up and what's the probability
> that allocation is simply refreshed versus starting over?
> 

Allocating everything during the warm-up phase, compared to no allocation, reduced the total VM downtime from 439 ms to 128 ms.
This was tested using two PCI VF hardware devices per VM.

The benefit comes from the device state staying mostly the same.

We tested with different configurations from 1 to 4 devices per VM, varied with vcpus and memory.
Also, more detailed test results are captured in Figure-2 on page 6 at [1].

The commit log for patch-7 should have captured the perf summary table for the value of the 7th patch.

Yishai,
If you are planning to send next revision, please add it.

> Re-sending the initial data up to some arbitrary cap sounds more like we're
> making a policy decision in the driver to consume more migration bandwidth
> for some unknown latency trade-off at stop-copy.  I wonder if that advantage
> disappears if the pre-copy data is at all stale relative to the current device
> state.  Thanks,
> 

You're right. If the pre-copy data differs significantly from the current device state, the benefits might be lost.
However, this can also depend on the device's design. A more advanced device could apply a low-pass filter to avoid unnecessary refreshes.

> Alex

[1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf