On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > On 2024-07-12 08:57, Joel Fernandes wrote: > > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: > [...] > >> Existing use cases > >> ------------------------- > >> > >> - A latency sensitive workload on the guest might need more than one > >> time slice to complete, but should not block any higher priority task > >> in the host. In our design, the latency sensitive workload shares its > >> priority requirements to host(RT priority, cfs nice value etc). Host > >> implementation of the protocol sets the priority of the vcpu task > >> accordingly so that the host scheduler can make an educated decision > >> on the next task to run. This makes sure that host processes and vcpu > >> tasks compete fairly for the cpu resource. > > AFAIU, the information you need to convey to achieve this is the priority > of the task within the guest. This information need to reach the host > scheduler to make informed decision. > > One thing that is unclear about this is what is the acceptable > overhead/latency to push this information from guest to host ? > Is an hypercall OK or does it need to be exchanged over a memory > mapping shared between guest and host ? Shared memory for the boost (Can do it later during host preemption). But for unboost, we possibly need a hypercall in addition to it as well. > > Hypercalls provide simple ABIs across guest/host, and they allow > the guest to immediately notify the host (similar to an interrupt). > > Shared memory mapping will require a carefully crafted ABI layout, > and will only allow the host to use the information provided when > the host runs. Therefore, if the choice is to share this information > only through shared memory, the host scheduler will only be able to > read it when it runs, so in hypercall, interrupt, and so on. The initial idea was to handle the details/format/allocation of the shared memory out-of-band in a driver, but then later the rseq idea came up. > >> - Guest should be able to notify the host that it is running a lower > >> priority task so that the host can reschedule it if needed. As > >> mentioned before, the guest shares the priority with the host and the > >> host takes a better scheduling decision. > > It is unclear to me whether this information needs to be "pushed" > from guest to host (e.g. hypercall) in a way that allows the host > to immediately act on this information, or if it is OK to have the > host read this information when its scheduler happens to run. For boosting, there is no need to immediately push. Only on preemption. > >> - Proactive vcpu boosting for events like interrupt injection. > >> Depending on the guest for boost request might be too late as the vcpu > >> might not be scheduled to run even after interrupt injection. Host > >> implementation of the protocol boosts the vcpu tasks priority so that > >> it gets a better chance of immediately being scheduled and guest can > >> handle the interrupt with minimal latency. Once the guest is done > >> handling the interrupt, it can notify the host and lower the priority > >> of the vcpu task. > > This appears to be a scenario where the host sets a "high priority", and > the guest clears it when it is done with the irq handler. I guess it can > be done either ways (hypercall or shared memory), but the choice would > depend on the parameters identified above: acceptable overhead vs acceptable > latency to inform the host scheduler. Yes, we have found ways to reduce/make fewer hypercalls on unboost. > >> - Guests which assign specialized tasks to specific vcpus can share > >> that information with the host so that host can try to avoid > >> colocation of those cpus in a single physical cpu. for eg: there are > >> interrupt pinning use cases where specific cpus are chosen to handle > >> critical interrupts and passing this information to the host could be > >> useful. > > How frequently is this topology expected to change ? Is it something that > is set once when the guest starts and then is fixed ? How often it changes > will likely affect the tradeoffs here. Yes, will be fixed. > >> - Another use case is the sharing of cpu capacity details between > >> guest and host. Sharing the host cpu's load with the guest will enable > >> the guest to schedule latency sensitive tasks on the best possible > >> vcpu. This could be partially achievable by steal time, but steal time > >> is more apparent on busy vcpus. There are workloads which are mostly > >> sleepers, but wake up intermittently to serve short latency sensitive > >> workloads. input event handlers in chrome is one such example. > > OK so for this use-case information goes the other way around: from host > to guest. Here the shared mapping seems better than polling the state > through an hypercall. Yes, FWIW this particular part is for future and not initially required per-se. > >> Data from the prototype implementation shows promising improvement in > >> reducing latencies. Data was shared in the v1 cover letter. We have > >> not implemented the capacity based placement policies yet, but plan to > >> do that soon and have some real numbers to share. > >> > >> Ideas brought up during offlist discussion > >> ------------------------------------------------------- > >> > >> 1. rseq based timeslice extension mechanism[1] > >> > >> While the rseq based mechanism helps in giving the vcpu task one more > >> time slice, it will not help in the other use cases. We had a chat > >> with Steve and the rseq mechanism was mainly for improving lock > >> contention and would not work best with vcpu boosting considering all > >> the use cases above. RT or high priority tasks in the VM would often > >> need more than one time slice to complete its work and at the same, > >> should not be hurting the host workloads. The goal for the above use > >> cases is not requesting an extra slice, but to modify the priority in > >> such a way that host processes and guest processes get a fair way to > >> compete for cpu resources. This also means that vcpu task can request > >> a lower priority when it is running lower priority tasks in the VM. > > > > I was looking at the rseq on request from the KVM call, however it does not > > make sense to me yet how to expose the rseq area via the Guest VA to the host > > kernel. rseq is for userspace to kernel, not VM to kernel. > > > > Steven Rostedt said as much as well, thoughts? Add Mathieu as well. > > I'm not sure that rseq would help at all here, but I think we may want to > borrow concepts of data sitting in shared memory across privilege levels > and apply them to VMs. > > If some of the ideas end up being useful *outside* of the context of VMs, > then I'd be willing to consider adding fields to rseq. But as long as it is > VM-specific, I suspect you'd be better with dedicated per-vcpu pages which > you can safely share across host/guest kernels. Yes, this was the initial plan. I also feel rseq cannot be applied here. > > This idea seems to suffer from the same vDSO over-engineering below, rseq > > does not seem to fit. > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > driver, that does the boosting. > > I utterly dislike changing the system behavior through tracepoints. They were > designed to observe the system, not modify its behavior. If people start abusing > them, then subsystem maintainers will stop adding them. Please don't do that. > Add a notifier or think about integrating what you are planning to add into the > driver instead. Well, we do have "raw" tracepoints not accessible from userspace, so you're saying even those are off limits for adding callbacks? - Joel