On (23/10/24 10:34), Steven Rostedt wrote: > On Tue, 19 Sep 2023 01:42:03 +0200 > Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > > 2) When the scheduler wants to set NEED_RESCHED due it sets > > NEED_RESCHED_LAZY instead which is only evaluated in the return to > > user space preemption points. > > > > As NEED_RESCHED_LAZY is not folded into the preemption count the > > preemption count won't become zero, so the task can continue until > > it hits return to user space. > > > > That preserves the existing behaviour. > > I'm looking into extending this concept to user space and to VMs. > > I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis") > > The ideas is this. Have VMs/user space share a memory region with the > kernel that is per thread/vCPU. This would be registered via a syscall or > ioctl on some defined file or whatever. Then, when entering user space / > VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it > checks if the thread has this memory region and a special bit in it is > set, and if it does, it does not schedule. It will treat it like a long > kernel system call. > > The kernel will then set another bit in the shared memory region that will > tell user space / VM that the kernel wanted to schedule, but is allowing it > to finish its critical section. When user space / VM is done with the > critical section, it will check the bit that may be set by the kernel and > if it is set, it should do a sched_yield() or VMEXIT so that the kernel can > now schedule it. > > What about DOS you say? It's no different than running a long system call. > No task can run forever. It's not a "preempt disable", it's just "give me > some more time". A "NEED_RESCHED" will always schedule, just like a kernel > system call that takes a long time. The goal is to allow user space to get > out of critical sections that we know can cause problems if they get > preempted. Usually it's a user space / VM lock is held or maybe a VM > interrupt handler that needs to wake up a task on another vCPU. > > If we are worried about abuse, we could even punish tasks that don't call > sched_yield() by the time its extended time slice is taken. Even without > that punishment, if we have EEVDF, this extension will make it less > eligible the next time around. > > The goal is to prevent a thread / vCPU being preempted while holding a lock > or resource that other threads / vCPUs will want. That is, prevent > contention, as that's usually the biggest issue with performance in user > space and VMs. I think some time ago we tried to check guest's preempt count on each vm-exit and we'd vm-enter if guest exited from a critical section (those that bump preempt count) so that it can hopefully finish whatever is was going to do and vmexit again. We didn't look into covering guest's RCU read-side critical sections. Can you educate me, is your PoC significantly different from guest preempt count check?