Re: [PATCH vfio 0/7] Enhances the vfio-virtio driver to support live migration

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 29 Oct 2024 14:28:26 -0600

On Mon, 28 Oct 2024 17:46:57 +0000
Parav Pandit <parav@xxxxxxxxxx> wrote:

> > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > Sent: Monday, October 28, 2024 10:24 PM
> > 
> > On Mon, 28 Oct 2024 13:23:54 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >   
> > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > >  
> > > > If the virtio spec doesn't support partial contexts, what makes it
> > > > beneficial here?  
> > >
> > > It stil lets the receiver 'warm up', like allocating memory and
> > > approximately sizing things.
> > >  
> > > > If it is beneficial, why is it beneficial to send initial data more than
> > > > once?  
> > >
> > > I guess because it is allowed to change and the benefit is highest
> > > when the pre copy data closely matches the final data..  
> > 
> > It would be useful to see actual data here.  For instance, what is the latency
> > advantage to allocating anything in the warm-up and what's the probability
> > that allocation is simply refreshed versus starting over?
> >   
> 
> Allocating everything during the warm-up phase, compared to no
> allocation, reduced the total VM downtime from 439 ms to 128 ms. This
> was tested using two PCI VF hardware devices per VM.
>
> The benefit comes from the device state staying mostly the same.
> 
> We tested with different configurations from 1 to 4 devices per VM,
> varied with vcpus and memory. Also, more detailed test results are
> captured in Figure-2 on page 6 at [1].

Those numbers seems to correspond to column 1 of Figure 2 in the
referenced document, but that's looking only at downtime.  To me that
chart seems to show a step function where there's ~400ms of downtime
per device, which suggests we're serializing device resume in the
stop-copy phase on the target without pre-copy.

Figure 3 appears to look at total VM migration time, where pre-copy
tends to show marginal improvements in smaller configurations, but up
to 60% worse overall migration time as the vCPU, device, and VM memory
size increase.  The paper comes to the conclusion:

	It can be concluded that either of increasing the VM memory or
	device configuration has equal effect on the VM total migration
	time, but no effect on the VM downtime due to pre-copy
	enablement.

Noting specifically "downtime" here ignores that the overall migration
time actually got worse with pre-copy.

Between columns 10 & 11 the device count is doubled.  With pre-copy
enabled, the migration time increases by 135% while with pre-copy
disabled we only only see a 113% increase.  Between columns 11 & 12 the
VM memory is further doubled.  This results in another 33% increase in
migration time with pre-copy enabled and only a 3% increase with
pre-copy disabled.  For the most part this entire figure shows that
overall migration time with pre-copy enabled is either on par with or
worse than the same with pre-copy disabled.

We then move on to Tables 1 & 2, which are again back to specifically
showing timing of operations related to downtime rather than overall
migration time.  The notable thing here seems to be that we've
amortized the 300ms per device load time across the pre-copy phase,
leaving only 11ms per device contributing to downtime.

However, the paper also goes into this tangent:

	Our observations indicate that enabling device-level pre-copy
	results in more pre-copy operations of the system RAM and
	device state. This leads to a 50% reduction in memory (RAM)
	copy time in the device pre-copy method in the micro-benchmark
	results, saving 100 milliseconds of downtime.

I'd argue that this is an anti-feature.  A less generous interpretation
is that pre-copy extended the migration time, likely resulting in more
RAM transfer during pre-copy, potentially to the point that the VM
undershot its prescribed downtime.  Further analysis should also look
at the total data transferred for the migration and adherence to the
configured VM downtime, rather than just the absolute downtime.

At the end of the paper, I think we come to the same conclusion shown
in Figure 1, where device load seems to be serialized and therefore
significantly limits scalability.  That could be parallelized, but
even 300-400ms for loading all devices is still too much contribution to
downtime.  I'd therefore agree that pre-loading the device during
pre-copy improves the scaling by an order of magnitude, but it doesn't
solve the scaling problem.  Also, it should not come with the cost of
drawing out pre-copy and thus the overall migration time to this
extent.  The reduction in downtime related to RAM copy time should be
evidence that the pre-copy behavior here has exceeded its scope and is
interfering with the balance between pre- and post- copy elsewhere.
Thanks,

Alex

> 
> [1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf
>