On Fri, 12 Jul 2024 11:32:30 -0400 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > >>> I was looking at the rseq on request from the KVM call, however it does not > >>> make sense to me yet how to expose the rseq area via the Guest VA to the host > >>> kernel. rseq is for userspace to kernel, not VM to kernel. > > > > Any memory that is exposed to host userspace can be exposed to the guest. Things > > like this are implemented via "overlay" pages, where the guest asks host userspace > > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the > > address of the page containing the rseq structure associated with the vCPU (in > > pretty much every modern VMM, each vCPU has a dedicated task/thread). > > > > A that point, the vCPU can read/write the rseq structure directly. So basically, the vCPU thread can just create a virtio device that exposes the rseq memory to the guest kernel? One other issue we need to worry about is that IIUC rseq memory is allocated by the guest/user, not the host kernel. This means it can be swapped out. The code that handles this needs to be able to handle user page faults. > > This helps me understand what you are trying to achieve. I disagree with > some aspects of the design you present above: mainly the lack of > isolation between the guest kernel and the host task doing the KVM_RUN. > We do not want to let the guest kernel store to rseq fields that would > result in getting the host task killed (e.g. a bogus rseq_cs pointer). > But this is something we can improve upon once we understand what we > are trying to achieve. > > > > > The reason us KVM folks are pushing y'all towards something like rseq is that > > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU > > is actually just priority boosting a task. So rather than invent something > > virtualization specific, invent a mechanism for priority boosting from userspace > > without a syscall, and then extend it to the virtualization use case. > > > [...] > > OK, so how about we expose "offsets" tuning the base values ? > > - The task doing KVM_RUN, just like any other task, has its "priority" > value as set by setpriority(2). > > - We introduce two new fields in the per-thread struct rseq, which is > mapped in the host task doing KVM_RUN and readable from the scheduler: > > - __s32 prio_offset; /* Priority offset to apply on the current task priority. */ > > - __u64 vcpu_sched; /* Pointer to a struct vcpu_sched in user-space */ > > vcpu_sched would be a userspace pointer to a new vcpu_sched structure, > which would be typically NULL except for tasks doing KVM_RUN. This would > sit in its own pages per vcpu, which takes care of isolation between guest > kernel and host process. Those would be RW by the guest kernel as > well and contain e.g.: Hmm, maybe not make this only vcpu specific, but perhaps this can be useful for user space tasks that want to dynamically change their priority without a system call. It could do the same thing. Yeah, yeah, I may be coming up with a solution in search of a problem ;-) -- Steve > > struct vcpu_sched { > __u32 len; /* Length of active fields. */ > > __s32 prio_offset; > __s32 cpu_capacity_offset; > [...] > }; > > So when the host kernel try to calculate the effective priority of a task > doing KVM_RUN, it would basically start from its current priority, and offset > by (rseq->prio_offset + rseq->vcpu_sched->prio_offset). > > The cpu_capacity_offset would be populated by the host kernel and read by the > guest kernel scheduler for scheduling/migration decisions. > > I'm certainly missing details about how priority offsets should be bounded for > given tasks. This could be an extension to setrlimit(2). > > Thoughts ? > > Thanks, > > Mathieu >