On Thu, 26 Oct 2023 18:31:56 +0200 Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote: > > This feature is a performance boost only, and has nothing to do with > > "correctness". That's because it has that arbitrary time where it can run a > > little more. It's more like the difference between having something in > > cache and a cache miss. This would cause many academics to quit and find a > > job in sales if they had to prove the correctness of an algorithm that gave > > you a boost for some random amount of time. The idea here is to help with > > performance. If it exists, great, your application will likely perform > > better. If it doesn't, no big deal, you may just have to deal with longer > > wait times on critical sections. > > terminologies, terminologies.... those academic people :-) I hope this doesn't cause you to quit and switch to a career in sales! > > I think that this can also be seen as an extension of the non-preemptive > mode to the user space, but... not entirely, it is a ceiling to the > [ higher than fair/lower than RT ] prior? Well, it's just an extended time slice of SCHED_OTHER (up to 1 ms on 1000Hz to 4 ms on 250Hz). But if an RT or DL task were to wake up it would preempt it immediately. This feature is at the whims of the kernel implementation that provides no guarantees. It's just a hint from user space asking the kernel if it can have a little more time to get out of a critical section where the time slice ended unfortunately while the task was in a critical section. The kernel is allowed to deny the request. > > and it is not global. It is partitioned: once the section starts, it stays > there, being preempted by RT/DL? Basically yes. Looking at the v6.6-rc4 kernel (which is where I started from), the base time slice is 3ms. # cat /sys/kernel/debug/sched/base_slice_ns 3000000 Note, when I upped this to 6ms, the benefits of this patch did drop. That makes total sense because that would drop the number of times the critical section would be preempted. Basically, it does somewhat the same thing by extending all time slices. With this feature enabled, if the schedule slice ends on a critical section that has this special bit set, the kernel will give up to 1 more ms (1000 HZ) to get out of that section. It will also tell user space that it is running on extended time by setting bit 1 (0x2). When user space leaves the critical section, it should check that bit and if it is set call any system call which the kernel will then call schedule. In my example, I just used sched_yield(), but it would work with gettid() as well. Sure, user space can ignore that bit from the kernel and continue, but when that 1ms is up, the kernel will preempt that task with prejudice, regardless if its in a critical section or not. It's in the task's best interest to make that system call when it knows it's the best time to do so (not within a critical section). If it does not, it risks being preempted within a critical section. Not to mention that the EEVDF scheduler will lower it's eligibility for the next round. -- Steve