On Thu, Nov 17, 2022, Paolo Bonzini wrote: > On 11/7/22 22:21, Sean Christopherson wrote: > > > > Hmm, and the memslot heuristic doesn't address the recovery worker holding mmu_lock > > for write. On a non-preemptible kernel, rwlock_needbreak() is always false, e.g. > > the worker won't yield to vCPUs that are trying to handle non-fast page faults. > > The worker should eventually reach steady state by unaccounting everything, but > > that might take a while. > > I'm not sure what you mean here? The recovery worker will still decrease > to_zap by 1 on every unaccounted NX hugepage, and go to sleep after it > reaches 0. Right, what I'm saying is that this approach is still sub-optimal because it does all that work will holding mmu_lock for write. > Also, David's test used a 10-second halving time for the recovery thread. > With the 1 hour time the effect would Perhaps the 1 hour time used by > default by KVM is overly conservative, but 1% over 10 seconds is certainly a > lot larger an effect, than 1% over 1 hour. It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU operations on other tasks/vCPUs. Given that this is related to dirty logging, odds are very good that there will be a variety of operations in flight, e.g. KVM_GET_DIRTY_LOG. If the recovery ratio is aggressive, and/or there are a lot of pages to recover, the recovery thread could hold mmu_lock until a reched is needed.