Re: [PATCH v2] KVM: arm64: Add KVM_CAP to control WFx trapping

Quentin Perret <qperret@xxxxxxxxxx> · Fri, 22 Mar 2024 14:24:35 +0000

On Tuesday 19 Mar 2024 at 16:43:41 (+0000), Colton Lewis wrote:
> Add a KVM_CAP to control WFx (WFI or WFE) trapping based on scheduler
> runqueue depth. This is so they can be passed through if the runqueue
> is shallow or the CPU has support for direct interrupt injection. They
> may be always trapped by setting this value to 0. Technically this
> means traps will be cleared when the runqueue depth is 0, but that
> implies nothing is running anyway so there is no reason to care. The
> default value is 1 to preserve previous behavior before adding this
> option.

I recently discovered that this was enabled by default, but it's not
obvious to me everyone will want this enabled, so I'm in favour of
figuring out a way to turn it off (in fact we might want to make this
feature opt in as the status quo used to be to always trap).

There are a few potential issues I see with having this enabled:

 - a lone vcpu thread on a CPU will completely screw up the host
   scheduler's load tracking metrics if the vCPU actually spends a
   significant amount of time in WFI (the PELT signal will no longer
   be a good proxy for "how much CPU time does this task need");

 - the scheduler's decision will impact massively the behaviour of the
   vcpu task itself. Co-scheduling a task with a vcpu task (or not) will
   impact massively the perceived behaviour of the vcpu task in a way
   that is entirely unpredictable to the scheduler;

 - while the above problems might be OK for some users, I don't think
   this will always be true, e.g. when running on big.LITTLE systems the
   above sounds nightmare-ish;

 - the guest spending long periods of time in WFI prevents the host from
   being able to enter deeper idle states, which will impact power very
   negatively;

And probably a whole bunch of other things.

> Think about his option as a threshold. The instruction will be trapped
> if the runqueue depth is higher than the threshold.

So talking about the exact interface, I'm not sure exposing this to
userspace is really appropriate. The current rq depth is next to
impossible for userspace to control well.

My gut feeling tells me we might want to gate all of this on
PREEMPT_FULL instead, since PREEMPT_FULL is pretty much a way to say
"I'm willing to give up scheduler tracking accuracy to gain throughput
when I've got a task running alone on a CPU". Thoughts?

Thanks,
Quentin