> > Roughly summarazing an off-list discussion. > > > > - Discovery schedulers should be handled outside of KVM and the kernel, e.g. > > similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest. > > > > - "Negotiating" features/hooks should also be handled outside of the kernel, > > e.g. similar to how VirtIO devices negotiate features between host and guest. > > > > - Pushing PV scheduler entities to KVM should either be done through an exported > > API, e.g. if the scheduler is provided by a separate kernel module, or by a > > KVM or VM ioctl() (especially if the desire is to have per-VM schedulers). > > > > I think those were the main takeaways? Vineeth and Joel, please chime in on > > anything I've missed or misremembered. > > > Thanks for the brief about the offlist discussion, all the points are > captured, just some minor additions. v2 implementation removed the > scheduling policies outside of kvm to a separate entity called pvsched > driver and could be implemented as a kernel module or bpf program. But > the handshake between guest and host to decide on what pvsched driver > to attach was still going through kvm. So it was suggested to move > this handshake(discovery and negotiation) outside of kvm. The idea is > to have a virtual device exposed by the VMM which would take care of > the handshake. Guest driver for this device would talk to the device > to understand the pvsched details on the host and pass the shared > memory details. Once the handshake is completed, the device is > responsible for loading the pvsched driver(bpf program or kernel > module responsible for implementing the policies). The pvsched driver > will register to the trace points exported by kvm and handle the > callbacks from then on. The scheduling will be taken care of by the > host scheduler, pvsched driver on host is responsible only for setting > the policies(placement, priorities etc). > > With the above approach, the only change in kvm would be the internal > tracepoints for pvsched. Host kernel will also be unchanged and all > the complexities move to the VMM and the pvsched driver. Guest kernel > will have a new driver to talk to the virtual pvsched device and this > driver would hook into the guest kernel for passing scheduling > information to the host(via tracepoints). > Noting down the recent offlist discussion and details of our response. Based on the previous discussions, we had come up with a modified design focusing on minimum kvm changes. The design is as follows: - Guest and host share scheduling information via shared memory region. Details of the layout of the memory region, information shared and actions and policies are defined by the pvsched protocol. And this protocol is implemented by a BPF program or a kernel module. - Host exposes a virtual device(pvsched device to the guest). This device is the mechanism for host and guest for handshake and negotiation to reach a decision on the pvsched protocol to use. The virtual device is implemented in the VMM in userland as it doesn't come in the performance critical path. - Guest loads a pvsched driver during device enumeration. the driver initiates the protocol handshake and negotiation with the host and decides on the protocol. This driver creates a per-cpu shared memory region and shares the GFN with the device in the host. Guest also loads the BPF program that implements the protocol in the guest. - Once the VMM has all the information needed(per-cpu shared memory GFN, vcpu task pids etc), it loads the BPF program which implements the protocol on the host. - BPF program on the host registers the trace points in kvm to get callbacks on interested events like VMENTER, VMEXIT, interrupt injection etc. Similarly, the guest BPF program registers tracepoints in the guest kernel for interested events like sched wakeup, sched switch, enqueue, dequeue, irq entry/exit etc. The above design is minimally invasive to the kvm and core kernel and implements the protocol as loadable programs and protocol handshake and negotiation through the virtual device framework. Protocol implementation takes care of information sharing and policy enforcements and scheduler handles the actual scheduling decisions. Sample policy implementation(boosting for latency sensitive workloads as an example) could be included in the kernel for reference. We had an offlist discussion about the above design and a couple of ideas were suggested as an alternative. We had taken an action item to study the alternatives for the feasibility. Rest of the mail lists the use cases(not conclusive) and our feasibility investigations. Existing use cases ------------------------- - A latency sensitive workload on the guest might need more than one time slice to complete, but should not block any higher priority task in the host. In our design, the latency sensitive workload shares its priority requirements to host(RT priority, cfs nice value etc). Host implementation of the protocol sets the priority of the vcpu task accordingly so that the host scheduler can make an educated decision on the next task to run. This makes sure that host processes and vcpu tasks compete fairly for the cpu resource. - Guest should be able to notify the host that it is running a lower priority task so that the host can reschedule it if needed. As mentioned before, the guest shares the priority with the host and the host takes a better scheduling decision. - Proactive vcpu boosting for events like interrupt injection. Depending on the guest for boost request might be too late as the vcpu might not be scheduled to run even after interrupt injection. Host implementation of the protocol boosts the vcpu tasks priority so that it gets a better chance of immediately being scheduled and guest can handle the interrupt with minimal latency. Once the guest is done handling the interrupt, it can notify the host and lower the priority of the vcpu task. - Guests which assign specialized tasks to specific vcpus can share that information with the host so that host can try to avoid colocation of those cpus in a single physical cpu. for eg: there are interrupt pinning use cases where specific cpus are chosen to handle critical interrupts and passing this information to the host could be useful. - Another use case is the sharing of cpu capacity details between guest and host. Sharing the host cpu's load with the guest will enable the guest to schedule latency sensitive tasks on the best possible vcpu. This could be partially achievable by steal time, but steal time is more apparent on busy vcpus. There are workloads which are mostly sleepers, but wake up intermittently to serve short latency sensitive workloads. input event handlers in chrome is one such example. Data from the prototype implementation shows promising improvement in reducing latencies. Data was shared in the v1 cover letter. We have not implemented the capacity based placement policies yet, but plan to do that soon and have some real numbers to share. Ideas brought up during offlist discussion ------------------------------------------------------- 1. rseq based timeslice extension mechanism[1] While the rseq based mechanism helps in giving the vcpu task one more time slice, it will not help in the other use cases. We had a chat with Steve and the rseq mechanism was mainly for improving lock contention and would not work best with vcpu boosting considering all the use cases above. RT or high priority tasks in the VM would often need more than one time slice to complete its work and at the same, should not be hurting the host workloads. The goal for the above use cases is not requesting an extra slice, but to modify the priority in such a way that host processes and guest processes get a fair way to compete for cpu resources. This also means that vcpu task can request a lower priority when it is running lower priority tasks in the VM. 2. vDSO approach Regarding the vDSO approach, we had a look at that and feel that without a major redesign of vDSO, it might be difficult to achieve the requirements. vDSO is currently implemented as a shared read-only memory region with the processes. For this to work with virtualization, we would need to map a similar region to the guest and it has to be read-write. This is more or less what we are also proposing, but with minimal changes in the core kernel. With the current design, the shared memory region would be the responsibility of the virtual pvsched device framework. Sorry for the long mail. Please have a look and let us know your thoughts :-) Thanks, [1]: https://lore.kernel.org/all/20231025235413.597287e1@xxxxxxxxxxxxxxxxxx/