> From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Friday, November 1, 2024 9:55 PM > > On Thu, 31 Oct 2024 15:04:51 +0000 > Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > Sent: Wednesday, October 30, 2024 1:58 AM > > > > > > On Mon, 28 Oct 2024 17:46:57 +0000 > > > Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > > > Sent: Monday, October 28, 2024 10:24 PM > > > > > > > > > > On Mon, 28 Oct 2024 13:23:54 -0300 Jason Gunthorpe > > > > > <jgg@xxxxxxxxxx> wrote: > > > > > > > > > > > On Mon, Oct 28, 2024 at 10:13:48AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > If the virtio spec doesn't support partial contexts, what > > > > > > > makes it beneficial here? > > > > > > > > > > > > It stil lets the receiver 'warm up', like allocating memory > > > > > > and approximately sizing things. > > > > > > > > > > > > > If it is beneficial, why is it beneficial to send initial > > > > > > > data more than once? > > > > > > > > > > > > I guess because it is allowed to change and the benefit is > > > > > > highest when the pre copy data closely matches the final data.. > > > > > > > > > > It would be useful to see actual data here. For instance, what > > > > > is the latency advantage to allocating anything in the warm-up > > > > > and what's the probability that allocation is simply refreshed > > > > > versus starting > > > over? > > > > > > > > > > > > > Allocating everything during the warm-up phase, compared to no > > > > allocation, reduced the total VM downtime from 439 ms to 128 ms. > > > > This was tested using two PCI VF hardware devices per VM. > > > > > > > > The benefit comes from the device state staying mostly the same. > > > > > > > > We tested with different configurations from 1 to 4 devices per > > > > VM, varied with vcpus and memory. Also, more detailed test results > > > > are captured in Figure-2 on page 6 at [1]. > > > > > > Those numbers seems to correspond to column 1 of Figure 2 in the > > > referenced document, but that's looking only at downtime. > > Yes. > > What do you mean by only looking at the downtime? > > It's just a prelude to my interpretation that the paper is focusing mostly on the > benefits to downtime and downplaying the apparent longer overall migration > time while rationalizing the effect on RAM migration downtime. > Now after the new debug we shared, we know the other areas too. > > The intention was to measure the downtime in various configurations. > > Do you mean, we should have looked at migration bandwidth, migration > amount of data, migration time too? > > If so, yes, some of them were not considered as the focus was on two > things: > > a. total VM downtime > > b. total migration time > > > > But with recent tests, we looked at more things. Explained more below. > > Good. Yes, there should be a more holistic approach, improving the thing we > intend to improve without degrading other aspects. > Yes. > > > To me that chart > > > seems to show a step function where there's ~400ms of downtime per > > > device, which suggests we're serializing device resume in the > > > stop-copy phase on the target without pre-copy. > > > > > Yes. even without serialization, when there is single device, same bottleneck > can be observed. > > And your orthogonal suggestion of using parallelism is very useful. > > The paper captures this aspect in text on page 7 after the Table 2. > > > > > Figure 3 appears to look at total VM migration time, where pre-copy > > > tends to show marginal improvements in smaller configurations, but > > > up to 60% worse overall migration time as the vCPU, device, and VM > memory size increase. > > > The paper comes to the conclusion: > > > > > > It can be concluded that either of increasing the VM memory or > > > device configuration has equal effect on the VM total migration > > > time, but no effect on the VM downtime due to pre-copy > > > enablement. > > > > > > Noting specifically "downtime" here ignores that the overall > > > migration time actually got worse with pre-copy. > > > > > > Between columns 10 & 11 the device count is doubled. With pre-copy > > > enabled, the migration time increases by 135% while with pre-copy > > > disabled we only only see a 113% increase. Between columns 11 & 12 > > > the VM memory is further doubled. This results in another 33% > > > increase in migration time with pre-copy enabled and only a 3% > > > increase with pre-copy disabled. For the most part this entire > > > figure shows that overall migration time with pre-copy enabled is > > > either on par with or worse than the same with pre-copy disabled. > > > > > I will answer this part in more detail towards the end of the email. > > > > > We then move on to Tables 1 & 2, which are again back to > > > specifically showing timing of operations related to downtime rather than > overall > > > migration time. > > Yes, because the objective was to analyze the effects and improvements on > downtime of various configurations of device, VM, pre-copy. > > > > > The notable thing here seems to be that we've amortized the 300ms > > > per device load time across the pre-copy phase, leaving only 11ms > > > per device contributing to downtime. > > > > > Correct. > > > > > However, the paper also goes into this tangent: > > > > > > Our observations indicate that enabling device-level pre-copy > > > results in more pre-copy operations of the system RAM and > > > device state. This leads to a 50% reduction in memory (RAM) > > > copy time in the device pre-copy method in the micro-benchmark > > > results, saving 100 milliseconds of downtime. > > > > > > I'd argue that this is an anti-feature. A less generous > > > interpretation is that pre-copy extended the migration time, likely > > > resulting in more RAM transfer during pre-copy, potentially to the point > that the VM undershot its > > > prescribed downtime. > > VM downtime was close to the configured downtime, on slightly higher side. > > > > > Further analysis should also look at the total data transferred for > > > the migration and adherence to the configured VM downtime, rather > > > than just the absolute downtime. > > > > > We did look the device side total data transferred to see how many iterations > of pre-copy done. > > > > > At the end of the paper, I think we come to the same conclusion > > > shown in Figure 1, where device load seems to be serialized and > > > therefore significantly limits scalability. That could be > > > parallelized, but even 300-400ms for loading all devices is still > > > too much contribution to downtime. I'd therefore agree that pre-loading > the device during pre-copy improves the scaling by an order > > > of magnitude, > > Yep. > > > but it doesn't solve the scaling problem. > > Yes, your suggestion is very valid. > > Parallel operation from the qemu would make the downtime even smaller. > > The paper also highlighted this in page 7 after Table-2. > > > > > Also, it should not > > > come with the cost of drawing out pre-copy and thus the overall migration > > > time to this extent. > > Right. You pointed out rightly. > > So we did several more tests in last 2 days for insights you provided. > > And found an interesting outcome. > > > > In 30+ samples, we collected for each, > > (a) pre-copy enabled and > > (b) pre-copy disabled. > > > > This was done for column 10 and 11. > > > > The VM total migration time varied in range of 13 seconds to 60 seconds. > > Most noticeably with pre-copy off also it varied in such large range. > > > > In the paper it was pure co-incidence that every time pre-copy=on had > > higher migration time compared to pre-copy=on. This led us to > > Assuming typo here, =on vs =off. > Correct it is pre-copy=off. > > misguide that pre-copy influenced the higher migration time. > > > > After some reason, we found the QEMU anomaly which was fixed/overcome > > by the knob " avail-switchover-bandwidth". Basically the bandwidth > > calculation was not accurate, due to which the migration time > > fluctuated a lot. This problem and solution are described in [2]. > > > > Following the solution_2, > > We ran exact same tests of column 10 and 11, with " > > avail-switchover-bandwidth" configured. With that for both the modes > > pre-copy=on and off the total migration time stayed constant to 14-15 > > seconds. > > > > And this conclusion aligns with your analysis that "pre-copy should > > not extent the migration time to this much". Great finding, proving > > that figure_3 was incomplete in the paper. > > Great! So with this the difference in downtime related to RAM migration in the > trailing tables of the paper becomes negligible? Yes. > Is this using the originally > proposed algorithm of migrating device data up to 128 consecutive times or is > it using rate-limiting of device data in pre-copy? Both. Yishai has new rate limiting based algorithm which also has similar results. > Any notable differences > between those algorithms? > No significant differences. Vfio level data transfer size is less now, as the frequency is reduced with your suggested algorithm. > > > The reduction in downtime related to RAM copy time should be > > > evidence that the pre-copy behavior here has exceeded its scope and > > > is interfering with the balance between pre- and post- copy > > > elsewhere. > > As I explained above, pre-copy did its job, it didn't interfere. It > > was just not enough and right samples to analyze back then. Now it is > > resolved. Thanks a lot for the direction. > > Glad we could arrive at a better understanding overall. Thanks, > > Alex