On Tue, 19 Sep 2023 01:42:03 +0200 Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. > > As NEED_RESCHED_LAZY is not folded into the preemption count the > preemption count won't become zero, so the task can continue until > it hits return to user space. > > That preserves the existing behaviour. I'm looking into extending this concept to user space and to VMs. I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis") The ideas is this. Have VMs/user space share a memory region with the kernel that is per thread/vCPU. This would be registered via a syscall or ioctl on some defined file or whatever. Then, when entering user space / VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it checks if the thread has this memory region and a special bit in it is set, and if it does, it does not schedule. It will treat it like a long kernel system call. The kernel will then set another bit in the shared memory region that will tell user space / VM that the kernel wanted to schedule, but is allowing it to finish its critical section. When user space / VM is done with the critical section, it will check the bit that may be set by the kernel and if it is set, it should do a sched_yield() or VMEXIT so that the kernel can now schedule it. What about DOS you say? It's no different than running a long system call. No task can run forever. It's not a "preempt disable", it's just "give me some more time". A "NEED_RESCHED" will always schedule, just like a kernel system call that takes a long time. The goal is to allow user space to get out of critical sections that we know can cause problems if they get preempted. Usually it's a user space / VM lock is held or maybe a VM interrupt handler that needs to wake up a task on another vCPU. If we are worried about abuse, we could even punish tasks that don't call sched_yield() by the time its extended time slice is taken. Even without that punishment, if we have EEVDF, this extension will make it less eligible the next time around. The goal is to prevent a thread / vCPU being preempted while holding a lock or resource that other threads / vCPUs will want. That is, prevent contention, as that's usually the biggest issue with performance in user space and VMs. I'm going to work on a POC, and see if I can get some benchmarks on how much this could help tasks like databases and VMs in general. -- Steve