On Mon, Aug 19, 2024, David Matlack wrote: > On Mon, Aug 19, 2024 at 10:20 AM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote: > > > > On 2024-08-16 16:29:11, Sean Christopherson wrote: > > > On Mon, Aug 12, 2024, Vipin Sharma wrote: > > > > + list_for_each_entry(sp, &kvm->arch.possible_nx_huge_pages, possible_nx_huge_page_link) { > > > > + if (i++ >= max) > > > > + break; > > > > + if (is_tdp_mmu_page(sp) == tdp_mmu) > > > > + return sp; > > > > + } > > > > > > This is silly and wasteful. E.g. in the (unlikely) case there's one TDP MMU > > > page amongst hundreds/thousands of shadow MMU pages, this will walk the list > > > until @max, and then move on to the shadow MMU. > > > > > > Why not just use separate lists? > > > > Before this patch, NX huge page recovery calculates "to_zap" and then it > > zaps first "to_zap" pages from the common list. This series is trying to > > maintain that invarient. I wouldn't try to maintain any specific behavior in the existing code, AFAIK it's 100% arbitrary and wasn't written with any meaningful sophistication. E.g. FIFO is little more than blindly zapping pages and hoping for the best. > > If we use two separate lists then we have to decide how many pages > > should be zapped from TDP MMU and shadow MMU list. Few options I can > > think of: > > > > 1. Zap "to_zap" pages from both TDP MMU and shadow MMU list separately. > > Effectively, this might double the work for recovery thread. > > 2. Try zapping "to_zap" page from one list and if there are not enough > > pages to zap then zap from the other list. This can cause starvation. > > 3. Do half of "to_zap" from one list and another half from the other > > list. This can lead to situations where only half work is being done > > by the recovery worker thread. > > > > Option (1) above seems more reasonable to me. > > I vote each should zap 1/nx_huge_pages_recovery_ratio of their > respective list. i.e. Calculate to_zap separately for each list. Yeah, I don't have a better idea since this is effectively a quick and dirty solution to reduce guest jitter. We can at least add a counter so that the zap is proportional to the number of pages on each list, e.g. this, and then do the necessary math in the recovery paths. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 94e7b5a4fafe..3ff17ff4f78b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1484,6 +1484,8 @@ struct kvm_arch { * the code to do so. */ spinlock_t tdp_mmu_pages_lock; + + u64 tdp_mmu_nx_page_splits; #endif /* CONFIG_X86_64 */ /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 928cf84778b0..b80fe5d4e741 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -870,6 +870,11 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) if (!list_empty(&sp->possible_nx_huge_page_link)) return; +#ifdef CONFIG_X86_64 + if (is_tdp_mmu_page(sp)) + ++kvm->arch.tdp_mmu_nx_page_splits; +#endif + ++kvm->stat.nx_lpage_splits; list_add_tail(&sp->possible_nx_huge_page_link, &kvm->arch.possible_nx_huge_pages); @@ -905,6 +910,10 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) if (list_empty(&sp->possible_nx_huge_page_link)) return; +#ifdef CONFIG_X86_64 + if (is_tdp_mmu_page(sp)) + --kvm->arch.tdp_mmu_nx_page_splits; +#endif --kvm->stat.nx_lpage_splits; list_del_init(&sp->possible_nx_huge_page_link);