On Mon, 28 Oct 2024 17:46:57 +0000 Parav Pandit <parav@xxxxxxxxxx> wrote: > > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > > Sent: Monday, October 28, 2024 10:24 PM > > > > On Mon, 28 Oct 2024 13:23:54 -0300 > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > If the virtio spec doesn't support partial contexts, what makes it > > > > beneficial here? > > > > > > It stil lets the receiver 'warm up', like allocating memory and > > > approximately sizing things. > > > > > > > If it is beneficial, why is it beneficial to send initial data more than > > > > once? > > > > > > I guess because it is allowed to change and the benefit is highest > > > when the pre copy data closely matches the final data.. > > > > It would be useful to see actual data here. For instance, what is the latency > > advantage to allocating anything in the warm-up and what's the probability > > that allocation is simply refreshed versus starting over? > > > > Allocating everything during the warm-up phase, compared to no > allocation, reduced the total VM downtime from 439 ms to 128 ms. This > was tested using two PCI VF hardware devices per VM. > > The benefit comes from the device state staying mostly the same. > > We tested with different configurations from 1 to 4 devices per VM, > varied with vcpus and memory. Also, more detailed test results are > captured in Figure-2 on page 6 at [1]. Those numbers seems to correspond to column 1 of Figure 2 in the referenced document, but that's looking only at downtime. To me that chart seems to show a step function where there's ~400ms of downtime per device, which suggests we're serializing device resume in the stop-copy phase on the target without pre-copy. Figure 3 appears to look at total VM migration time, where pre-copy tends to show marginal improvements in smaller configurations, but up to 60% worse overall migration time as the vCPU, device, and VM memory size increase. The paper comes to the conclusion: It can be concluded that either of increasing the VM memory or device configuration has equal effect on the VM total migration time, but no effect on the VM downtime due to pre-copy enablement. Noting specifically "downtime" here ignores that the overall migration time actually got worse with pre-copy. Between columns 10 & 11 the device count is doubled. With pre-copy enabled, the migration time increases by 135% while with pre-copy disabled we only only see a 113% increase. Between columns 11 & 12 the VM memory is further doubled. This results in another 33% increase in migration time with pre-copy enabled and only a 3% increase with pre-copy disabled. For the most part this entire figure shows that overall migration time with pre-copy enabled is either on par with or worse than the same with pre-copy disabled. We then move on to Tables 1 & 2, which are again back to specifically showing timing of operations related to downtime rather than overall migration time. The notable thing here seems to be that we've amortized the 300ms per device load time across the pre-copy phase, leaving only 11ms per device contributing to downtime. However, the paper also goes into this tangent: Our observations indicate that enabling device-level pre-copy results in more pre-copy operations of the system RAM and device state. This leads to a 50% reduction in memory (RAM) copy time in the device pre-copy method in the micro-benchmark results, saving 100 milliseconds of downtime. I'd argue that this is an anti-feature. A less generous interpretation is that pre-copy extended the migration time, likely resulting in more RAM transfer during pre-copy, potentially to the point that the VM undershot its prescribed downtime. Further analysis should also look at the total data transferred for the migration and adherence to the configured VM downtime, rather than just the absolute downtime. At the end of the paper, I think we come to the same conclusion shown in Figure 1, where device load seems to be serialized and therefore significantly limits scalability. That could be parallelized, but even 300-400ms for loading all devices is still too much contribution to downtime. I'd therefore agree that pre-loading the device during pre-copy improves the scaling by an order of magnitude, but it doesn't solve the scaling problem. Also, it should not come with the cost of drawing out pre-copy and thus the overall migration time to this extent. The reduction in downtime related to RAM copy time should be evidence that the pre-copy behavior here has exceeded its scope and is interfering with the balance between pre- and post- copy elsewhere. Thanks, Alex > > [1] https://netdevconf.info/0x18/docs/netdev-0x18-paper22-talk-paper.pdf >