Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 12 Jul 2024 10:09:03 -0400

On 2024-07-12 08:57, Joel Fernandes wrote:
On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
[...]
Existing use cases
-------------------------

- A latency sensitive workload on the guest might need more than one
time slice to complete, but should not block any higher priority task
in the host. In our design, the latency sensitive workload shares its
priority requirements to host(RT priority, cfs nice value etc). Host
implementation of the protocol sets the priority of the vcpu task
accordingly so that the host scheduler can make an educated decision
on the next task to run. This makes sure that host processes and vcpu
tasks compete fairly for the cpu resource.

AFAIU, the information you need to convey to achieve this is the priority
of the task within the guest. This information need to reach the host
scheduler to make informed decision.

One thing that is unclear about this is what is the acceptable
overhead/latency to push this information from guest to host ?
Is an hypercall OK or does it need to be exchanged over a memory
mapping shared between guest and host ?

Hypercalls provide simple ABIs across guest/host, and they allow
the guest to immediately notify the host (similar to an interrupt).

Shared memory mapping will require a carefully crafted ABI layout,
and will only allow the host to use the information provided when
the host runs. Therefore, if the choice is to share this information
only through shared memory, the host scheduler will only be able to
read it when it runs, so in hypercall, interrupt, and so on.

- Guest should be able to notify the host that it is running a lower
priority task so that the host can reschedule it if needed. As
mentioned before, the guest shares the priority with the host and the
host takes a better scheduling decision.

It is unclear to me whether this information needs to be "pushed"
from guest to host (e.g. hypercall) in a way that allows the host
to immediately act on this information, or if it is OK to have the
host read this information when its scheduler happens to run.

- Proactive vcpu boosting for events like interrupt injection.
Depending on the guest for boost request might be too late as the vcpu
might not be scheduled to run even after interrupt injection. Host
implementation of the protocol boosts the vcpu tasks priority so that
it gets a better chance of immediately being scheduled and guest can
handle the interrupt with minimal latency. Once the guest is done
handling the interrupt, it can notify the host and lower the priority
of the vcpu task.

This appears to be a scenario where the host sets a "high priority", and
the guest clears it when it is done with the irq handler. I guess it can
be done either ways (hypercall or shared memory), but the choice would
depend on the parameters identified above: acceptable overhead vs acceptable
latency to inform the host scheduler.

- Guests which assign specialized tasks to specific vcpus can share
that information with the host so that host can try to avoid
colocation of those cpus in a single physical cpu. for eg: there are
interrupt pinning use cases where specific cpus are chosen to handle
critical interrupts and passing this information to the host could be
useful.

How frequently is this topology expected to change ? Is it something that
is set once when the guest starts and then is fixed ? How often it changes
will likely affect the tradeoffs here.

- Another use case is the sharing of cpu capacity details between
guest and host. Sharing the host cpu's load with the guest will enable
the guest to schedule latency sensitive tasks on the best possible
vcpu. This could be partially achievable by steal time, but steal time
is more apparent on busy vcpus. There are workloads which are mostly
sleepers, but wake up intermittently to serve short latency sensitive
workloads. input event handlers in chrome is one such example.

OK so for this use-case information goes the other way around: from host
to guest. Here the shared mapping seems better than polling the state
through an hypercall.

Data from the prototype implementation shows promising improvement in
reducing latencies. Data was shared in the v1 cover letter. We have
not implemented the capacity based placement policies yet, but plan to
do that soon and have some real numbers to share.

Ideas brought up during offlist discussion
-------------------------------------------------------

1. rseq based timeslice extension mechanism[1]

While the rseq based mechanism helps in giving the vcpu task one more
time slice, it will not help in the other use cases. We had a chat
with Steve and the rseq mechanism was mainly for improving lock
contention and would not work best with vcpu boosting considering all
the use cases above. RT or high priority tasks in the VM would often
need more than one time slice to complete its work and at the same,
should not be hurting the host workloads. The goal for the above use
cases is not requesting an extra slice, but to modify the priority in
such a way that host processes and guest processes get a fair way to
compete for cpu resources. This also means that vcpu task can request
a lower priority when it is running lower priority tasks in the VM.

I was looking at the rseq on request from the KVM call, however it does not
make sense to me yet how to expose the rseq area via the Guest VA to the host
kernel.  rseq is for userspace to kernel, not VM to kernel.

Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

I'm not sure that rseq would help at all here, but I think we may want to
borrow concepts of data sitting in shared memory across privilege levels
and apply them to VMs.

If some of the ideas end up being useful *outside* of the context of VMs,
then I'd be willing to consider adding fields to rseq. But as long as it is
VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
you can safely share across host/guest kernels.

This idea seems to suffer from the same vDSO over-engineering below, rseq
does not seem to fit.

Steven Rostedt told me, what we instead need is a tracepoint callback in a
driver, that does the boosting.

I utterly dislike changing the system behavior through tracepoints. They were
designed to observe the system, not modify its behavior. If people start abusing
them, then subsystem maintainers will stop adding them. Please don't do that.
Add a notifier or think about integrating what you are planning to add into the
driver instead.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com