On Thu, Mar 9, 2023 at 4:56 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > On Thu, Mar 09, 2023 at 04:28:10PM -0800, Vipin Sharma wrote: > > On Thu, Mar 9, 2023 at 3:53 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > > > On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote: > > > > Create a global counter for total number of pages available > > > > in MMU page caches across all VMs. Add mmu_shadow_page_cache > > > > pages to this counter. > > > > > > I think I prefer counting the objects on-demand in mmu_shrink_count(), > > > instead of keeping track of the count. Keeping track of the count adds > > > complexity to the topup/alloc paths for the sole benefit of the > > > shrinker. I'd rather contain that complexity to the shrinker code unless > > > there is a compelling reason to optimize mmu_shrink_count(). > > > > > > IIRC we discussed this at one point. Was there a reason to take this > > > approach that I'm just forgetting? > > > > To count on demand, we first need to lock on kvm_lock and then for > > each VMs iterate through each vCPU, take a lock, and sum the objects > > count in caches. When the NUMA support will be introduced in this > > series then it means we have to iterate even more caches. We > > can't/shouldn't use mutex_trylock() as it will not give the correct > > picture and when shrink_scan is called count can be totally different. > > Yeah good point. Hm, do we need to take the cache mutex to calculate the > count though? mmu_shrink_count() is inherently racy (something could get > freed or allocated in between count() and scan()). I don't think holding > the mutex buys us anything over just reading the count without the > mutex. > You are right, mutex and percpu_counter both are not not solving accuracy problems with the shrinker. So, this can be removed. > e.g. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index df8dcb7e5de7..c80a5c52f0ea 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6739,10 +6739,20 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink, > static unsigned long mmu_shrink_count(struct shrinker *shrink, > struct shrink_control *sc) > { > - s64 count = percpu_counter_sum(&kvm_total_unused_cached_pages); > + struct kvm *kvm, *next_kvm; > + unsigned long count = 0; > > - WARN_ON(count < 0); > - return count <= 0 ? SHRINK_EMPTY : count; > + mutex_lock(&kvm_lock); > + list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) { > + struct kvm_vcpu *vcpu; > + unsigned long i; > + > + kvm_for_each_vcpu(i, vcpu, kvm) > + count += READ_ONCE(vcpu->arch.mmu_shadow_page_cache.nobjs); > + } > + mutex_unlock(&kvm_lock); > + > + return count == 0 ? SHRINK_EMPTY : count; > > } > > Then the only concern is an additional acquire of kvm_lock. But it > should be fairly quick (quicker than mmu_shrink_scan()). If we can > tolerate the kvm_lock overhead of mmu_shrink_scan(), then we should be > able to tolerate some here. > > > > > scan_count() API comment says to not do any deadlock check (I don't > > know what does that mean) and percpu_counter is very fast when we are > > adding/subtracting pages so the effect of using it to keep global > > count is very minimal. Since, there is not much impact to using > > percpu_count compared to previous one, we ended our discussion with > > keeping this per cpu counter. > > Yeah it's just the code complexity of maintaing > kvm_total_unused_cached_pages that I'm hoping to avoid. We have to > create the counter, destroy it, and keep it up to date. Some > kvm_mmu_memory_caches have to update the counter, and others don't. It's > just adds a lot of bookkeeping code that I'm not convinced is worth the > it. Yeah, it will simplify code a lot. Also, we also don't need 100% accuracy with Shrinker. I will remove this global counter in the next version.