RE: [PATCH vfio 0/7] Enhances the vfio-virtio driver to support live migration

Parav Pandit <parav@xxxxxxxxxx> · Thu, 31 Oct 2024 15:04:51 +0000

> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Wednesday, October 30, 2024 1:58 AM
> 
> On Mon, 28 Oct 2024 17:46:57 +0000
> Parav Pandit <parav@xxxxxxxxxx> wrote:
> 
> > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > Sent: Monday, October 28, 2024 10:24 PM
> > >
> > > On Mon, 28 Oct 2024 13:23:54 -0300
> > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > >
> > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > > >
> > > > > If the virtio spec doesn't support partial contexts, what makes
> > > > > it beneficial here?
> > > >
> > > > It stil lets the receiver 'warm up', like allocating memory and
> > > > approximately sizing things.
> > > >
> > > > > If it is beneficial, why is it beneficial to send initial data
> > > > > more than once?
> > > >
> > > > I guess because it is allowed to change and the benefit is highest
> > > > when the pre copy data closely matches the final data..
> > >
> > > It would be useful to see actual data here.  For instance, what is
> > > the latency advantage to allocating anything in the warm-up and
> > > what's the probability that allocation is simply refreshed versus starting
> over?
> > >
> >
> > Allocating everything during the warm-up phase, compared to no
> > allocation, reduced the total VM downtime from 439 ms to 128 ms. This
> > was tested using two PCI VF hardware devices per VM.
> >
> > The benefit comes from the device state staying mostly the same.
> >
> > We tested with different configurations from 1 to 4 devices per VM,
> > varied with vcpus and memory. Also, more detailed test results are
> > captured in Figure-2 on page 6 at [1].
> 
> Those numbers seems to correspond to column 1 of Figure 2 in the
> referenced document, but that's looking only at downtime.  
Yes.
What do you mean by only looking at the downtime?
The intention was to measure the downtime in various configurations.
Do you mean, we should have looked at migration bandwidth, migration amount of data, migration time too?
If so, yes, some of them were not considered as the focus was on two things:
a. total VM downtime
b. total migration time

But with recent tests, we looked at more things. Explained more below.

> To me that chart
> seems to show a step function where there's ~400ms of downtime per
> device, which suggests we're serializing device resume in the stop-copy
> phase on the target without pre-copy.
>
Yes. even without serialization, when there is single device, same bottleneck can be observed.
And your orthogonal suggestion of using parallelism is very useful.
The paper captures this aspect in text on page 7 after the Table 2.

> Figure 3 appears to look at total VM migration time, where pre-copy tends to
> show marginal improvements in smaller configurations, but up to 60% worse
> overall migration time as the vCPU, device, and VM memory size increase.
> The paper comes to the conclusion:
> 
> 	It can be concluded that either of increasing the VM memory or
> 	device configuration has equal effect on the VM total migration
> 	time, but no effect on the VM downtime due to pre-copy
> 	enablement.
> 
> Noting specifically "downtime" here ignores that the overall migration time
> actually got worse with pre-copy.
> 
> Between columns 10 & 11 the device count is doubled.  With pre-copy
> enabled, the migration time increases by 135% while with pre-copy disabled
> we only only see a 113% increase.  Between columns 11 & 12 the VM
> memory is further doubled.  This results in another 33% increase in
> migration time with pre-copy enabled and only a 3% increase with pre-copy
> disabled.  For the most part this entire figure shows that overall migration
> time with pre-copy enabled is either on par with or worse than the same
> with pre-copy disabled.
>
I will answer this part in more detail towards the end of the email.

> We then move on to Tables 1 & 2, which are again back to specifically
> showing timing of operations related to downtime rather than overall
> migration time. 
Yes, because the objective was to analyze the effects and improvements on downtime of various configurations of device, VM, pre-copy.

> The notable thing here seems to be that we've amortized
> the 300ms per device load time across the pre-copy phase, leaving only 11ms
> per device contributing to downtime.
> 
Correct.

> However, the paper also goes into this tangent:
> 
> 	Our observations indicate that enabling device-level pre-copy
> 	results in more pre-copy operations of the system RAM and
> 	device state. This leads to a 50% reduction in memory (RAM)
> 	copy time in the device pre-copy method in the micro-benchmark
> 	results, saving 100 milliseconds of downtime.
> 
> I'd argue that this is an anti-feature.  A less generous interpretation is that
> pre-copy extended the migration time, likely resulting in more RAM transfer
> during pre-copy, potentially to the point that the VM undershot its
> prescribed downtime.  
VM downtime was close to the configured downtime, on slightly higher side.

> Further analysis should also look at the total data
> transferred for the migration and adherence to the configured VM
> downtime, rather than just the absolute downtime.
>
We did look the device side total data transferred to see how many iterations of pre-copy done.

> At the end of the paper, I think we come to the same conclusion shown in
> Figure 1, where device load seems to be serialized and therefore significantly
> limits scalability.  That could be parallelized, but even 300-400ms for loading
> all devices is still too much contribution to downtime.  I'd therefore agree
> that pre-loading the device during pre-copy improves the scaling by an order
> of magnitude, 
Yep.
> but it doesn't solve the scaling problem.  
Yes, your suggestion is very valid.
Parallel operation from the qemu would make the downtime even smaller.
The paper also highlighted this in page 7 after Table-2.

> Also, it should not
> come with the cost of drawing out pre-copy and thus the overall migration
> time to this extent.  
Right. You pointed out rightly.
So we did several more tests in last 2 days for insights you provided.
And found an interesting outcome.

In 30+ samples, we collected for each, 
(a) pre-copy enabled and
(b) pre-copy disabled.

This was done for column 10 and 11.

The VM total migration time varied in range of 13 seconds to 60 seconds.
Most noticeably with pre-copy off also it varied in such large range.

In the paper it was pure co-incidence that every time pre-copy=on had higher migration time compared to pre-copy=on.
This led us to misguide that pre-copy influenced the higher migration time.

After some reason, we found the QEMU anomaly which was fixed/overcome by the knob " avail-switchover-bandwidth".
Basically the bandwidth calculation was not accurate, due to which the migration time fluctuated a lot.
This problem and solution are described in [2].

Following the solution_2, 
We ran exact same tests of column 10 and 11, with " avail-switchover-bandwidth" configured.
With that for both the modes pre-copy=on and off the total migration time stayed constant to 14-15 seconds.

And this conclusion aligns with your analysis that "pre-copy should not extent the migration time to this much".
Great finding, proving that figure_3 was incomplete in the paper.

> The reduction in downtime related to RAM copy time
> should be evidence that the pre-copy behavior here has exceeded its scope
> and is interfering with the balance between pre- and post- copy elsewhere.
As I explained above, pre-copy did its job, it didn't interfere. It was just not enough and right samples to analyze back then.
Now it is resolved. Thanks a lot for the direction.

> Thanks,
> 
> Alex
> 
> >
> > [1]
> > https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf
> >

[2] https://lore.kernel.org/qemu-devel/20231010221922.40638-1-peterx@xxxxxxxxxx/