Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice

Steven Rostedt <rostedt@xxxxxxxxxxx> · Thu, 26 Oct 2023 13:26:39 -0400

On Thu, 26 Oct 2023 18:31:56 +0200
Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:

> > This feature is a performance boost only, and has nothing to do with
> > "correctness". That's because it has that arbitrary time where it can run a
> > little more. It's more like the difference between having something in
> > cache and a cache miss. This would cause many academics to quit and find a
> > job in sales if they had to prove the correctness of an algorithm that gave
> > you a boost for some random amount of time. The idea here is to help with
> > performance. If it exists, great, your application will likely perform
> > better. If it doesn't, no big deal, you may just have to deal with longer
> > wait times on critical sections.  
> 
> terminologies, terminologies.... those academic people :-)

I hope this doesn't cause you to quit and switch to a career in sales!

> 
> I think that this can also be seen as an extension of the non-preemptive
> mode to the user space, but... not entirely, it is a ceiling to the
> [ higher than fair/lower than RT ] prior?

Well, it's just an extended time slice of SCHED_OTHER (up to 1 ms on 1000Hz
to 4 ms on 250Hz). But if an RT or DL task were to wake up it would preempt
it immediately. This feature is at the whims of the kernel implementation
that provides no guarantees. It's just a hint from user space asking the
kernel if it can have a little more time to get out of a critical section
where the time slice ended unfortunately while the task was in a critical
section. The kernel is allowed to deny the request.

> 
> and it is not global. It is partitioned: once the section starts, it stays
> there, being preempted by RT/DL?

Basically yes. Looking at the v6.6-rc4 kernel (which is where I started
from), the base time slice is 3ms.

  # cat /sys/kernel/debug/sched/base_slice_ns
  3000000

Note, when I upped this to 6ms, the benefits of this patch did drop. That
makes total sense because that would drop the number of times the critical
section would be preempted. Basically, it does somewhat the same thing by
extending all time slices.

With this feature enabled, if the schedule slice ends on a critical section
that has this special bit set, the kernel will give up to 1 more ms (1000
HZ) to get out of that section. It will also tell user space that it is
running on extended time by setting bit 1 (0x2). When user space leaves the
critical section, it should check that bit and if it is set call any system
call which the kernel will then call schedule. In my example, I just used
sched_yield(), but it would work with gettid() as well.

Sure, user space can ignore that bit from the kernel and continue, but when
that 1ms is up,  the kernel will preempt that task with prejudice,
regardless if its in a critical section or not. It's in the task's best
interest to make that system call when it knows it's the best time to do so
(not within a critical section). If it does not, it risks being preempted
within a critical section. Not to mention that the EEVDF scheduler will
lower it's eligibility for the next round.

-- Steve