Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 17 Nov 2022 09:15:58 -0800

On Thu, Nov 17, 2022 at 9:04 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Nov 17, 2022, Paolo Bonzini wrote:
> > On 11/17/22 17:39, Sean Christopherson wrote:
> > > Right, what I'm saying is that this approach is still sub-optimal because it does
> > > all that work will holding mmu_lock for write.
> > >
> > > > Also, David's test used a 10-second halving time for the recovery thread.
> > > > With the 1 hour time the effect would Perhaps the 1 hour time used by
> > > > default by KVM is overly conservative, but 1% over 10 seconds is certainly a
> > > > lot larger an effect, than 1% over 1 hour.
> > >
> > > It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
> > > operations on other tasks/vCPUs.  Given that this is related to dirty logging,
> > > odds are very good that there will be a variety of operations in flight, e.g.
> > > KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
> > > of pages to recover, the recovery thread could hold mmu_lock until a reched is
> > > needed.
> >
> > If you need that, you need to configure your kernel to be preemptible, at
> > least voluntarily.  That's in general a good idea for KVM, given its
> > rwlock-happiness.
>
> IMO, it's not that simple.  We always "need" better live migration performance,
> but we don't need/want preemption in general.
>
> > And the patch is not making it worse, is it?  Yes, you have to look up the
> > memslot, but the work to do that should be less than what you save by not
> > zapping the page.
>
> Yes, my objection  is that we're adding a heuristic to guess at userspace's
> intentions (it's probably a good guess, but still) and the resulting behavior isn't
> optimal.  Giving userspace an explicit knob seems straightforward and would address
> both of those issues, why not go that route?

In this case KVM knows that zapping dirty-tracked pages is completely
useless, regardless of what userspace is doing, so there's no
guessing.

A userspace knob requires userspace guess at KVM's implementation
details. e.g. KVM could theoretically support faulting in read
accesses and execute accesses as write-protected huge pages during
dirty logging. Or KVM might supporting 2MiB+ dirty logging. In both
cases a binary userspace knob might not be the best fit.

I agree that, even with this patch, KVM is still suboptimal because it
is holding the MMU lock to do all these checks. But this patch should
at least be a step in the right direction for reducing customer
hiccups during live migration.

Also as for the CPU usage, I did a terrible job of explaining the
impact. It's a 1% increase over the current usage, but the current
usage is extremely low even with my way overly aggressive settings.
Specifically, the CPU usage of the NX recovery worker increased from
0.73 CPU-seconds to 0.74 CPU-seconds over a 2.5 minute runtime.