On Wed, Apr 28, 2021 at 4:34 AM zhukeqian <zhukeqian1@xxxxxxxxxx> wrote: > > Oh, I have to correct myself. > > without this opt: > first round dirtying: write fault and split large mapping > second round: write fault > > with this opt: > first round dirtying: no write fault > second round: write fault and split large mapping > > the total test time is expected to be reduced. Oh yeah, good point. So we should really see the savings in the first round dirty memory time. Good catch. > > On 2021/4/28 0:33, Ben Gardon wrote: > > On Mon, Apr 26, 2021 at 10:04 PM Keqian Zhu < zhukeqian1@xxxxxxxxxx> wrote: > >> > >> Hi Ben, > >> > >> Sorry for the delay reply! > >> > >> On 2021/4/21 0:30, Ben Gardon wrote: > >>> On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu < zhukeqian1@xxxxxxxxxx> wrote: > >>>> > >>>> Hi Ben, > >>>> > >>>> On 2021/4/20 3:20, Ben Gardon wrote: > >>>>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu < zhukeqian1@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> Currently during start dirty logging, if we're with init-all-set, > >>>>>> we write protect huge pages and leave normal pages untouched, for > >>>>>> that we can enable dirty logging for these pages lazily. > >>>>>> > >>>>>> Actually enable dirty logging lazily for huge pages is feasible > >>>>>> too, which not only reduces the time of start dirty logging, also > >>>>>> greatly reduces side-effect on guest when there is high dirty rate. > >>>>>> > >>>>>> Signed-off-by: Keqian Zhu < zhukeqian1@xxxxxxxxxx> > >>>>>> --- > >>>>>> arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++---- > >>>>>> arch/x86/kvm/x86.c | 37 +++++++++----------------------- > >>>>>> 2 files changed, 54 insertions(+), 31 deletions(-) > >>>>>> > >>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > >>>>>> index 2ce5bc2ea46d..98fa25172b9a 100644 > >>>>>> --- a/arch/x86/kvm/mmu/mmu.c > >>>>>> +++ b/arch/x86/kvm/mmu/mmu.c > >>>>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > >>>>>> * @gfn_offset: start of the BITS_PER_LONG pages we care about > >>>>>> * @mask: indicates which pages we should protect > >>>>>> * > >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty > >>>>>> - * logging we do not have any such mappings. > >>>>>> + * Used when we do not need to care about huge page mappings. > >>>>>> */ > >>>>>> static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, > >>>>>> struct kvm_memory_slot *slot, > >>>>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, > >>>>>> * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to > >>>>>> * enable dirty logging for them. > >>>>>> * > >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty > >>>>>> - * logging we do not have any such mappings. > >>>>>> + * We need to care about huge page mappings: e.g. during dirty logging we may > >>>>>> + * have any such mappings. > >>>>>> */ > >>>>>> void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > >>>>>> struct kvm_memory_slot *slot, > >>>>>> gfn_t gfn_offset, unsigned long mask) > >>>>>> { > >>>>>> + gfn_t start, end; > >>>>>> + > >>>>>> + /* > >>>>>> + * Huge pages are NOT write protected when we start dirty log with > >>>>>> + * init-all-set, so we must write protect them at here. > >>>>>> + * > >>>>>> + * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn > >>>>>> + * of memslot has no such restriction, so the range can cross two large > >>>>>> + * pages. > >>>>>> + */ > >>>>>> + if (kvm_dirty_log_manual_protect_and_init_set(kvm)) { > >>>>>> + start = slot->base_gfn + gfn_offset + __ffs(mask); > >>>>>> + end = slot->base_gfn + gfn_offset + __fls(mask); > >>>>>> + kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > >>>>>> + > >>>>>> + /* Cross two large pages? */ > >>>>>> + if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) != > >>>>>> + ALIGN(end << PAGE_SHIFT, PMD_SIZE)) > >>>>>> + kvm_mmu_slot_gfn_write_protect(kvm, slot, end, > >>>>>> + PG_LEVEL_2M); > >>>>>> + } > >>>>>> + > >>>>>> + /* > >>>>>> + * RFC: > >>>>>> + * > >>>>>> + * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns > >>>>>> + * true, because I am not very clear about the relationship between > >>>>>> + * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else > >>>>>> + * manner. > >>>>>> + * > >>>>>> + * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a > >>>>>> + * writable large page mapping in legacy mmu mapping or tdp mmu mapping. > >>>>>> + * Do we still have normal mapping in that case? (e.g. We have large > >>>>>> + * mapping in legacy mmu and normal mapping in tdp mmu). > >>>>> > >>>>> Right, we can't return early because the two MMUs could map the page > >>>>> in different ways, but each MMU could also map the page in multiple > >>>>> ways independently. > >>>>> For example, if the legacy MMU was being used and we were running a > >>>>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd > >>>>> still need kvm_mmu_slot_gfn_write_protect calls for both levels. > >>>>> I don't think there's a case where we can return early here with the > >>>>> information that the first calls to kvm_mmu_slot_gfn_write_protect > >>>>> access. > >>>> Thanks for the detailed explanation. > >>>> > >>>>> > >>>>>> + * > >>>>>> + * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large > >>>>>> + * page mapping exist. If it exists but is clean, we can return early. > >>>>>> + * However, we have to do invasive change. > >>>>> > >>>>> What do you mean by invasive change? > >>>> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large > >>>> and clean, so we can return early. However it's not a part of semantics of this function. > >>>> > >>>> If this is the final code, compared to old code, we have an extra gfn_write_protect(), > >>>> I don't whether it's acceptable? > >>> > >>> Ah, I see. Please correct me if I'm wrong, but I think that in order > >>> to check that the only mappings on the GFN range are large, we'd still > >>> have to go over the rmap for the 4k mappings, at least for the legacy > >>> MMU. In that case, we're doing about as much work as the extra > >>> gfn_write_protect and I don't think that we'd get any efficiency gain > >>> for the change in semantics. > >>> > >>> Likewise for the TDP MMU, if the GFN range is mapped both large and > >>> 4k, it would have to be in different TDP structures, so the efficiency > >>> gains would again not be very big. > >> I am not familiar with the MMU virtualization of x86 arch, but I think > >> you are right. > >> > >>> > >>> I'm really just guessing about those performance characteristics > >>> though. It would definitely help to have some performance data to back > >>> all this up. Even just a few runs of the dirty_log_perf_test (in > >>> selftests) could provide some interesting results, and I'd be happy to > >>> help review any improvements you might make to that test. > >>> > >>> Regardless, I'd be inclined to keep this change as simple as possible > >>> for now and the early return optimization could happen in a follow-up > >>> patch. I think the extra gfn_write_protect is acceptable, especially > >>> if you can show that it doesn't cause a big hit in performance when > >>> running the dirty_log_perf_test with 4k and 2m backing memory. > >> I tested it using dirty_log_perf_test, the result shows that performance > >> of clear_dirty_log different within 2%. > > > > I think there are a couple obstacles which make the stock > > dirty_log_perf_test less useful for measuring this optimization. > > > > 1. Variance between runs > > With only 16 vCPUs and whatever the associated default guest memory > > size is, random system events and daemons introduce a lot of variance, > > at least in my testing. I usually try to run the biggest VM I can to > > smooth that out, but even with a 96 vCPU VM, a 2% difference is often > > not statistically significant. CPU pinning for the vCPU threads would > > help a lot to reduce variance. I don't remember if anyone has > > implemented this yet. > Yes, this makes sense. > > > > > 2. The guest dirty pattern > > By default, each guest vCPU will dirty it's entire partition of guest > > memory on each iteration. This means that instead of amortizing out > > the cost of write-protecting and splitting large pages, we simply move > > the burden later in the process. I see you didn't include the time for > > each iteration below, but I would expect this patch to move some of > > the time from "Enabling dirty logging time" and "Dirtying memory time" > > for pass 1 to "Clear dirty log time" and "Dirtying memory time" for > > pass 2. I wouldn't expect the total time over 5 iterations to change > > for this test. > If we have large page mapping and are with this optimization, the "Enabling dirty logging time" > and the first round "Dirtying memory time" will be greatly reduced. > > However, I don't think other times (dirty_memory except first round, get_log, clear_log) are > expected to change compared to w/o optimization. Because after the first round "Dirtying memory", > all mappings have been split to normal mappings, so the situation is same as w/o this optimization. > > Maybe I miss something? > > > > > It would probably also serve us well to have some kind of "hot" subset > > of memory for each vCPU, since some of the benefit of lazy large page > > splitting depend on that access pattern. > > > > 3. Lockstep dirtying and dirty log collection > > While this test is currently great for timing dirty logging > > operations, it's not great for trickier analysis, especially > > reductions to guest degradation. In order to measure that we'd need to > > change the test to collect the dirty log as quickly as possible, > > independent of what the guest is doing and then also record how much > > "progress" the guest is able to make while all that is happening. > Yes, make sense. > > Does the "dirty log collection" contains "dirty log clear"? As I understand, the dirty log > collection is very fast, just some memory copy. But for "dirty log clear", we should modify mappings > and perform TLBI, the time is much longer. Yeah, sorry. By dirty log collection I meant get + clear since the test does both before it waits for the guest to dirty all memory again. > > > > > I'd be happy to help review any improvements to the test which you > > feel like making. > Thanks, Ben. emm... I feel very sorry that perhaps I don't have enough time to do this, many works are queued... > On the other hand, I think the "Dirtying memory time" of first round can show us the optimization. No worries, I think this is a good patch either way. No need to block on test improvements, from my perspective. > > > > >> > >> *Without this patch* > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.105203579s > >> Enabling dirty logging time: 0.000323444s > >> [...] > >> Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration) > >> Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration) > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.922764235s > >> Enabling dirty logging time: 0.000316473s > >> [...] > >> Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration) > >> Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration) > >> > >> > >> *With this patch* > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.244515198s > >> Enabling dirty logging time: 0.000280207s > >> [...] > >> Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration) > >> Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration) > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.244294061s > >> Enabling dirty logging time: 0.000273590s > >> [...] > >> Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration) > >> Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration) > >> > >> > >> I faced a problem that there is no huge page mapping when test with > >> "-s anonymous_hugetlb", both for TDP enabled or disabled. > > > > Do you mean that even before dirty logging was enabled, KVM didn't > > create any large mappings? That's odd. I would assume the backing > > memory allocation would just fail if there aren't enough hugepages > > available. > It's odd indeed. I can see there are large mapping when I do normal migration, but I > don't see large mapping when run this test. > > I have proofed the time of "clear dirty log" is not effected, what about send a > formal patch? That sounds good to me. > > Thanks, > Keqian >