Hi Marc, Sorry for the delay reply. On 2020/7/6 15:54, Marc Zyngier wrote: > Hi Keqian, > > On 2020-07-06 02:28, zhukeqian wrote: >> Hi Catalin and Marc, >> >> On 2020/7/2 21:55, Keqian Zhu wrote: >>> This patch series add support for dirty log based on HW DBM. >>> >>> It works well under some migration test cases, including VM with 4K >>> pages or 2M THP. I checked the SHA256 hash digest of all memory and >>> they keep same for source VM and destination VM, which means no dirty >>> pages is missed under hardware DBM. >>> >>> Some key points: >>> >>> 1. Only support hardware updates of dirty status for PTEs. PMDs and PUDs >>> are not involved for now. >>> >>> 2. About *performance*: In RFC patch, I have mentioned that for every 64GB >>> memory, KVM consumes about 40ms to scan all PTEs to collect dirty log. >>> This patch solves this problem through two ways: HW/SW dynamic switch >>> and Multi-core offload. >>> >>> HW/SW dynamic switch: Give userspace right to enable/disable hw dirty >>> log. This adds a new KVM cap named KVM_CAP_ARM_HW_DIRTY_LOG. We can >>> achieve this by change the kvm->arch.vtcr value and kick vCPUs out to >>> reload this value to VCTR_EL2. Then userspace can enable hw dirty log >>> at the begining and disable it when dirty pages is little and about to >>> stop VM, so VM downtime is not affected. >>> >>> Multi-core offload: Offload the PT scanning workload to multi-core can >>> greatly reduce scanning time. To promise we can complete in time, I use >>> smp_call_fuction to realize this policy, which utilize IPI to dispatch >>> workload to other CPUs. Under 128U Kunpeng 920 platform, it just takes >>> about 5ms to scan PTs of 256 RAM (use mempress and almost all PTs have >>> been established). And We dispatch workload iterately (every CPU just >>> scan PTs of 512M RAM for each iteration), so it won't affect physical >>> CPUs seriously. >> >> What do you think of these two methods to solve high-cost PTs scaning? Maybe >> you are waiting for PML like feature on ARM :-) , but for my test, DBM is usable >> after these two methods applied. > > Useable, maybe. But leaving to userspace the decision to switch from one > mode to another isn't an acceptable outcome. Userspace doesn't need nor > want to know about this. > OK, maybe this is worth discussing. The switch logic can be encapsulated into Qemu and can not be seen from VM users. Well, I think it maybe acceptable. :) > Another thing is that sending IPIs all over to trigger scanning may > work well on a system that runs a limited number of guests (or some > other userspace, actually), but I seriously doubt that it is impact > free once you start doing this on an otherwise loaded system. > Yes, it is not suitable to send IPIs to all other physical CPUs. Currently I just want to show you my idea and to prove it is effective. In real cloud product, we have resource isolation mechanism, so we will have a bit worse result (compared to 5ms) but we won't effect other VMs. > You may have better results by having an alternative mapping of your > S2 page tables so that they are accessible linearly, which would > sidestep the PT parsing altogether, probably saving some cycles. But Yeah, this is a good idea. But for my understanding, to make them linear, we have to preserve enough physical memory at VM start (may waste much memory), and the effect of this optimization *maybe* not obvious. > this is still a marginal gain compared to the overall overhead of > scanning 4kB of memory per 2MB of guest RAM, as opposed to 64 *bytes* > per 2MB (assuming strict 4kB mappings at S2, no block mappings). > I ever tested scanning PTs by reading only one byte of each PTE and the test result keeps same. So, when we scan PTs using just one core, the bottle-neck is CPU speed, instead of memory bandwidth. > Finally, this doesn't work with pages dirtied from DMA, which is the > biggest problem. If you cannot track pages that are dirtied behind your > back, what is the purpose of scanning the dirty bits? > > As for a PML-like feature, this would only be useful if the SMMU > architecture took part in it and provided consistent logging of > the dirtied pages in the IPA space. Only having it at the CPU level > would be making the exact same mistake. Even SMMU is equipped with PML like feature, we still rely on device suspend to avoid omitting dirty pages, so I think the only advantage of PML is reducing dirty log sync time compared to multi-core offload. Maybe I missing something? Thanks, Keqian _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm