Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Fri, 12 Jul 2024 12:24:28 -0400

On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> >> Existing use cases
> >> -------------------------
> >>
> >> - A latency sensitive workload on the guest might need more than one
> >> time slice to complete, but should not block any higher priority task
> >> in the host. In our design, the latency sensitive workload shares its
> >> priority requirements to host(RT priority, cfs nice value etc). Host
> >> implementation of the protocol sets the priority of the vcpu task
> >> accordingly so that the host scheduler can make an educated decision
> >> on the next task to run. This makes sure that host processes and vcpu
> >> tasks compete fairly for the cpu resource.
>
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
>
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?

Shared memory for the boost (Can do it later during host preemption).
But for unboost, we possibly need a hypercall in addition to it as
well.

>
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).
>
> Shared memory mapping will require a carefully crafted ABI layout,
> and will only allow the host to use the information provided when
> the host runs. Therefore, if the choice is to share this information
> only through shared memory, the host scheduler will only be able to
> read it when it runs, so in hypercall, interrupt, and so on.

The initial idea was to handle the details/format/allocation of the
shared memory out-of-band in a driver, but then later the rseq idea
came up.

> >> - Guest should be able to notify the host that it is running a lower
> >> priority task so that the host can reschedule it if needed. As
> >> mentioned before, the guest shares the priority with the host and the
> >> host takes a better scheduling decision.
>
> It is unclear to me whether this information needs to be "pushed"
> from guest to host (e.g. hypercall) in a way that allows the host
> to immediately act on this information, or if it is OK to have the
> host read this information when its scheduler happens to run.

For boosting, there is no need to immediately push. Only on preemption.

> >> - Proactive vcpu boosting for events like interrupt injection.
> >> Depending on the guest for boost request might be too late as the vcpu
> >> might not be scheduled to run even after interrupt injection. Host
> >> implementation of the protocol boosts the vcpu tasks priority so that
> >> it gets a better chance of immediately being scheduled and guest can
> >> handle the interrupt with minimal latency. Once the guest is done
> >> handling the interrupt, it can notify the host and lower the priority
> >> of the vcpu task.
>
> This appears to be a scenario where the host sets a "high priority", and
> the guest clears it when it is done with the irq handler. I guess it can
> be done either ways (hypercall or shared memory), but the choice would
> depend on the parameters identified above: acceptable overhead vs acceptable
> latency to inform the host scheduler.

Yes, we have found ways to reduce/make fewer hypercalls on unboost.

> >> - Guests which assign specialized tasks to specific vcpus can share
> >> that information with the host so that host can try to avoid
> >> colocation of those cpus in a single physical cpu. for eg: there are
> >> interrupt pinning use cases where specific cpus are chosen to handle
> >> critical interrupts and passing this information to the host could be
> >> useful.
>
> How frequently is this topology expected to change ? Is it something that
> is set once when the guest starts and then is fixed ? How often it changes
> will likely affect the tradeoffs here.

Yes, will be fixed.

> >> - Another use case is the sharing of cpu capacity details between
> >> guest and host. Sharing the host cpu's load with the guest will enable
> >> the guest to schedule latency sensitive tasks on the best possible
> >> vcpu. This could be partially achievable by steal time, but steal time
> >> is more apparent on busy vcpus. There are workloads which are mostly
> >> sleepers, but wake up intermittently to serve short latency sensitive
> >> workloads. input event handlers in chrome is one such example.
>
> OK so for this use-case information goes the other way around: from host
> to guest. Here the shared mapping seems better than polling the state
> through an hypercall.

Yes, FWIW this particular part is for future and not initially required per-se.

> >> Data from the prototype implementation shows promising improvement in
> >> reducing latencies. Data was shared in the v1 cover letter. We have
> >> not implemented the capacity based placement policies yet, but plan to
> >> do that soon and have some real numbers to share.
> >>
> >> Ideas brought up during offlist discussion
> >> -------------------------------------------------------
> >>
> >> 1. rseq based timeslice extension mechanism[1]
> >>
> >> While the rseq based mechanism helps in giving the vcpu task one more
> >> time slice, it will not help in the other use cases. We had a chat
> >> with Steve and the rseq mechanism was mainly for improving lock
> >> contention and would not work best with vcpu boosting considering all
> >> the use cases above. RT or high priority tasks in the VM would often
> >> need more than one time slice to complete its work and at the same,
> >> should not be hurting the host workloads. The goal for the above use
> >> cases is not requesting an extra slice, but to modify the priority in
> >> such a way that host processes and guest processes get a fair way to
> >> compete for cpu resources. This also means that vcpu task can request
> >> a lower priority when it is running lower priority tasks in the VM.
> >
> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel.  rseq is for userspace to kernel, not VM to kernel.
> >
> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
>
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
>
> If some of the ideas end up being useful *outside* of the context of VMs,
> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.

Yes, this was the initial plan. I also feel rseq cannot be applied here.

> > This idea seems to suffer from the same vDSO over-engineering below, rseq
> > does not seem to fit.
> >
> > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > driver, that does the boosting.
>
> I utterly dislike changing the system behavior through tracepoints. They were
> designed to observe the system, not modify its behavior. If people start abusing
> them, then subsystem maintainers will stop adding them. Please don't do that.
> Add a notifier or think about integrating what you are planning to add into the
> driver instead.

Well, we do have "raw" tracepoints not accessible from userspace, so
you're saying even those are off limits for adding callbacks?

 - Joel