RE: [PATCH vfio 0/7] Enhances the vfio-virtio driver to support live migration

Parav Pandit <parav@xxxxxxxxxx> · Sun, 3 Nov 2024 14:38:55 +0000



> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Friday, November 1, 2024 9:55 PM
> 
> On Thu, 31 Oct 2024 15:04:51 +0000
> Parav Pandit <parav@xxxxxxxxxx> wrote:
> 
> > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > Sent: Wednesday, October 30, 2024 1:58 AM
> > >
> > > On Mon, 28 Oct 2024 17:46:57 +0000
> > > Parav Pandit <parav@xxxxxxxxxx> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > > Sent: Monday, October 28, 2024 10:24 PM
> > > > >
> > > > > On Mon, 28 Oct 2024 13:23:54 -0300 Jason Gunthorpe
> > > > > <jgg@xxxxxxxxxx> wrote:
> > > > >
> > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > If the virtio spec doesn't support partial contexts, what
> > > > > > > makes it beneficial here?
> > > > > >
> > > > > > It stil lets the receiver 'warm up', like allocating memory
> > > > > > and approximately sizing things.
> > > > > >
> > > > > > > If it is beneficial, why is it beneficial to send initial
> > > > > > > data more than once?
> > > > > >
> > > > > > I guess because it is allowed to change and the benefit is
> > > > > > highest when the pre copy data closely matches the final data..
> > > > >
> > > > > It would be useful to see actual data here.  For instance, what
> > > > > is the latency advantage to allocating anything in the warm-up
> > > > > and what's the probability that allocation is simply refreshed
> > > > > versus starting
> > > over?
> > > > >
> > > >
> > > > Allocating everything during the warm-up phase, compared to no
> > > > allocation, reduced the total VM downtime from 439 ms to 128 ms.
> > > > This was tested using two PCI VF hardware devices per VM.
> > > >
> > > > The benefit comes from the device state staying mostly the same.
> > > >
> > > > We tested with different configurations from 1 to 4 devices per
> > > > VM, varied with vcpus and memory. Also, more detailed test results
> > > > are captured in Figure-2 on page 6 at [1].
> > >
> > > Those numbers seems to correspond to column 1 of Figure 2 in the
> > > referenced document, but that's looking only at downtime.
> > Yes.
> > What do you mean by only looking at the downtime?
> 
> It's just a prelude to my interpretation that the paper is focusing mostly on the
> benefits to downtime and downplaying the apparent longer overall migration
> time while rationalizing the effect on RAM migration downtime.
> 
Now after the new debug we shared, we know the other areas too.

> > The intention was to measure the downtime in various configurations.
> > Do you mean, we should have looked at migration bandwidth, migration
> amount of data, migration time too?
> > If so, yes, some of them were not considered as the focus was on two
> things:
> > a. total VM downtime
> > b. total migration time
> >
> > But with recent tests, we looked at more things. Explained more below.
> 
> Good.  Yes, there should be a more holistic approach, improving the thing we
> intend to improve without degrading other aspects.
> 
Yes.
> > > To me that chart
> > > seems to show a step function where there's ~400ms of downtime per
> > > device, which suggests we're serializing device resume in the
> > > stop-copy phase on the target without pre-copy.
> > >
> > Yes. even without serialization, when there is single device, same bottleneck
> can be observed.
> > And your orthogonal suggestion of using parallelism is very useful.
> > The paper captures this aspect in text on page 7 after the Table 2.
> >
> > > Figure 3 appears to look at total VM migration time, where pre-copy
> > > tends to show marginal improvements in smaller configurations, but
> > > up to 60% worse overall migration time as the vCPU, device, and VM
> memory size increase.
> > > The paper comes to the conclusion:
> > >
> > > 	It can be concluded that either of increasing the VM memory or
> > > 	device configuration has equal effect on the VM total migration
> > > 	time, but no effect on the VM downtime due to pre-copy
> > > 	enablement.
> > >
> > > Noting specifically "downtime" here ignores that the overall
> > > migration time actually got worse with pre-copy.
> > >
> > > Between columns 10 & 11 the device count is doubled.  With pre-copy
> > > enabled, the migration time increases by 135% while with pre-copy
> > > disabled we only only see a 113% increase.  Between columns 11 & 12
> > > the VM memory is further doubled.  This results in another 33%
> > > increase in migration time with pre-copy enabled and only a 3%
> > > increase with pre-copy disabled.  For the most part this entire
> > > figure shows that overall migration time with pre-copy enabled is
> > > either on par with or worse than the same with pre-copy disabled.
> > >
> > I will answer this part in more detail towards the end of the email.
> >
> > > We then move on to Tables 1 & 2, which are again back to
> > > specifically showing timing of operations related to downtime rather than
> overall
> > > migration time.
> > Yes, because the objective was to analyze the effects and improvements on
> downtime of various configurations of device, VM, pre-copy.
> >
> > > The notable thing here seems to be that we've amortized the 300ms
> > > per device load time across the pre-copy phase, leaving only 11ms
> > > per device contributing to downtime.
> > >
> > Correct.
> >
> > > However, the paper also goes into this tangent:
> > >
> > > 	Our observations indicate that enabling device-level pre-copy
> > > 	results in more pre-copy operations of the system RAM and
> > > 	device state. This leads to a 50% reduction in memory (RAM)
> > > 	copy time in the device pre-copy method in the micro-benchmark
> > > 	results, saving 100 milliseconds of downtime.
> > >
> > > I'd argue that this is an anti-feature.  A less generous
> > > interpretation is that pre-copy extended the migration time, likely
> > > resulting in more RAM transfer during pre-copy, potentially to the point
> that the VM undershot its
> > > prescribed downtime.
> > VM downtime was close to the configured downtime, on slightly higher side.
> >
> > > Further analysis should also look at the total data transferred for
> > > the migration and adherence to the configured VM downtime, rather
> > > than just the absolute downtime.
> > >
> > We did look the device side total data transferred to see how many iterations
> of pre-copy done.
> >
> > > At the end of the paper, I think we come to the same conclusion
> > > shown in Figure 1, where device load seems to be serialized and
> > > therefore significantly limits scalability.  That could be
> > > parallelized, but even 300-400ms for loading all devices is still
> > > too much contribution to downtime.  I'd therefore agree that pre-loading
> the device during pre-copy improves the scaling by an order
> > > of magnitude,
> > Yep.
> > > but it doesn't solve the scaling problem.
> > Yes, your suggestion is very valid.
> > Parallel operation from the qemu would make the downtime even smaller.
> > The paper also highlighted this in page 7 after Table-2.
> >
> > > Also, it should not
> > > come with the cost of drawing out pre-copy and thus the overall migration
> > > time to this extent.
> > Right. You pointed out rightly.
> > So we did several more tests in last 2 days for insights you provided.
> > And found an interesting outcome.
> >
> > In 30+ samples, we collected for each,
> > (a) pre-copy enabled and
> > (b) pre-copy disabled.
> >
> > This was done for column 10 and 11.
> >
> > The VM total migration time varied in range of 13 seconds to 60 seconds.
> > Most noticeably with pre-copy off also it varied in such large range.
> >
> > In the paper it was pure co-incidence that every time pre-copy=on had
> > higher migration time compared to pre-copy=on. This led us to
> 
> Assuming typo here, =on vs =off.
> 
Correct it is pre-copy=off.

> > misguide that pre-copy influenced the higher migration time.
> >
> > After some reason, we found the QEMU anomaly which was fixed/overcome
> > by the knob " avail-switchover-bandwidth". Basically the bandwidth
> > calculation was not accurate, due to which the migration time
> > fluctuated a lot. This problem and solution are described in [2].
> >
> > Following the solution_2,
> > We ran exact same tests of column 10 and 11, with "
> > avail-switchover-bandwidth" configured. With that for both the modes
> > pre-copy=on and off the total migration time stayed constant to 14-15
> > seconds.
> >
> > And this conclusion aligns with your analysis that "pre-copy should
> > not extent the migration time to this much". Great finding, proving
> > that figure_3 was incomplete in the paper.
> 
> Great!  So with this the difference in downtime related to RAM migration in the
> trailing tables of the paper becomes negligible?  
Yes.
> Is this using the originally
> proposed algorithm of migrating device data up to 128 consecutive times or is
> it using rate-limiting of device data in pre-copy?  
Both. Yishai has new rate limiting based algorithm which also has similar results.

> Any notable differences
> between those algorithms?
> 
No significant differences.
Vfio level data transfer size is less now, as the frequency is reduced with your suggested algorithm.

> > > The reduction in downtime related to RAM copy time should be
> > > evidence that the pre-copy behavior here has exceeded its scope and
> > > is interfering with the balance between pre- and post- copy
> > > elsewhere.
> > As I explained above, pre-copy did its job, it didn't interfere. It
> > was just not enough and right samples to analyze back then. Now it is
> > resolved. Thanks a lot for the direction.
> 
> Glad we could arrive at a better understanding overall.  Thanks,
> 
> Alex