On Mon, Aug 19, 2024 at 10:20 AM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote: > > On 2024-08-16 16:29:11, Sean Christopherson wrote: > > On Mon, Aug 12, 2024, Vipin Sharma wrote: > > > + list_for_each_entry(sp, &kvm->arch.possible_nx_huge_pages, possible_nx_huge_page_link) { > > > + if (i++ >= max) > > > + break; > > > + if (is_tdp_mmu_page(sp) == tdp_mmu) > > > + return sp; > > > + } > > > > This is silly and wasteful. E.g. in the (unlikely) case there's one TDP MMU > > page amongst hundreds/thousands of shadow MMU pages, this will walk the list > > until @max, and then move on to the shadow MMU. > > > > Why not just use separate lists? > > Before this patch, NX huge page recovery calculates "to_zap" and then it > zaps first "to_zap" pages from the common list. This series is trying to > maintain that invarient. > > If we use two separate lists then we have to decide how many pages > should be zapped from TDP MMU and shadow MMU list. Few options I can > think of: > > 1. Zap "to_zap" pages from both TDP MMU and shadow MMU list separately. > Effectively, this might double the work for recovery thread. > 2. Try zapping "to_zap" page from one list and if there are not enough > pages to zap then zap from the other list. This can cause starvation. > 3. Do half of "to_zap" from one list and another half from the other > list. This can lead to situations where only half work is being done > by the recovery worker thread. > > Option (1) above seems more reasonable to me. I vote each should zap 1/nx_huge_pages_recovery_ratio of their respective list. i.e. Calculate to_zap separately for each list.