Re: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice

Steven Rostedt <rostedt@xxxxxxxxxxx> · Wed, 25 Oct 2023 08:54:34 -0400

Peter!

[ After watching Thomas and Paul reply to each other, I figured this is the
  new LMKL greeting. ]

On Wed, 25 Oct 2023 12:29:52 +0200
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote:
> 
> > That is, there's this structure for every thread. It's assigned with:
> > 
> > 	fd = open("/sys/kernel/extend_sched", O_RDWR);
> > 	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > 
> > I don't actually like this interface, as it wastes a full page for just two
> > bits :-p  Perhaps it should be a new system call, where it just locks in
> > existing memory from the user application? The requirement is that each
> > thread needs its own bits to play with. It should not be shared with other
> > threads. It could be, as it will not mess up the kernel, but will mess up
> > the application.  
> 
> What was wrong with using rseq?

I didn't want to overload that for something completely different. This is
not a "restartable sequence".

> 
> > Anyway, to tell the kernel to "extend" the time slice if possible because
> > it's in a critical section, we have:
> > 
> >  static void extend(void)
> >  {
> > 	if (!extend_map)
> > 		return;
> > 
> > 	extend_map->flags = 1;
> >  }
> > 
> > And to say that's it's done:
> > 
> >  static void unextend(void)
> >  {
> > 	unsigned long prev;
> > 
> > 	if (!extend_map)
> > 		return;
> > 
> > 	prev = xchg(&extend_map->flags, 0);
> > 	if (prev & 2)
> > 		sched_yield();
> >  }
> > 
> > So, bit 1 is for user space to tell the kernel "please extend me", and bit
> > two is for the kernel to tell user space "OK, I extended you, but call
> > sched_yield() when done".  
> 
> So what if it doesn't ? Can we kill it for not playing nice ?

No, it's no different than a system call running for a long time. You could
set this bit and leave it there for as long as you want, and it should not
affect anything. If you look at what Thomas's PREEMPT_AUTO.patch does, is
that it sets NEED_RESCHED_LAZY at the tick. Without my patch, this will not
schedule right away, but will schedule when going into user space. My patch
will ignore the schedule if NEED_RESCHED_LAZY is set when going into user
space.

With Thomas's patch, if a task is in the kernel for too long, on the next
tick (if I read his code correctly), if NEED_RESCHED_LAZY is still set,
it will then force the schedule. That is, you get two ticks instead of one.
I may have misread the code, but that's what it looks like it does in
update_deadline() in fair.c.

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches#n587

With my patch, the same thing happens but in user space. This does not give
any more power to any task. I don't expect nor want this to be a privilege
operation. It's no different than running a long system call. And EEVDF
should even keep it fair. As if you use an extra tick, it will go against
your eligibility for the next time around.

Note, NEED_RESCHED still schedules. If a RT or DL task were to wake up, it
will immediately preempt this task regardless of that bit being set.

> 
> [ aside from it being bit 0 and bit 1 as you yourself point out, it is
>   also jarring you use a numeral for one and write out the other. ]
> 
> That said, I properly hate all these things, extending a slice doesn't
> reliably work and we're always left with people demanding an ever longer
> extension.

We could possibly make it adjustable. I'm guessing that will happen anyway
with Thomas's patch. Anyway, my test shows that it makes a huge improvement
for user space implemented spin locks, which I tailored this after how
Postgresql does their spin locks. That is, this is a real world use case. I
plan to implement this in Postgresql and see what improvements it makes in
their tests.

I also plan on testing VMs.

> 
> The *much* better heuristic is what the kernel uses, don't spin if the
> lock holder isn't running.

No it is not. That is a completely useless heuristic for this use case.
That's for waiters and I would guess would make no difference in my test.
The point of this patch is to keep the lock holder running not the waiter
spinning. The reason for the improvement in my test is that the lock was
always held for a very short time and when the time slice came up while the
task was holding the lock, it was able to get it extended to release it, and
then schedule.

Without my patch, you get a several hundreds of this:

    extend-sched-3773  [000]  9628.573272: print:                tracing_mark_write: Have lock!
    extend-sched-3773  [000]  9628.573278: sched_switch:         extend-sched:3773 [120] R ==> mysqld:1216 [120]
          mysqld-1216  [000]  9628.573286: sched_switch:         mysqld:1216 [120] S ==> extend-sched:3773 [120]
    extend-sched-3773  [000]  9628.573287: print:                tracing_mark_write: released lock!

[ Ironically, this example is preempted by mysqld ]

With my patch, there was only a single instance during the run.

When a lock holder schedules out, it greatly increases contention on that
lock. That's the entire reason Thomas implemented NEED_RESCHED_LAZY in the
first place. The aggressive preemption in PREEMPT_RT caused a lot more
contention on spin lock turned mutexes. My patch is to do the exact same
thing for user space implement spin locks, which also includes spin locks
in VM kernels. Adaptive spin locks (spin on owner running) helped
PREEMPT_RT for waiters, but that did nothing to help the lock holder being
preempted, and why NEED_RESCHED_LAZY was still needed even when the kernel
already had adaptive spinners.

The reason I've been told over the last few decades of why people implement
100% user space spin locks is because the overhead of going int the kernel
is way too high. Sleeping is much worse (but that is where the adaptive
spinning comes in, which is a separate issue).

Allowing user space to say "hey, give me a few more microseconds and I'm
fine being preempted" is a very good heuristic. And a way for the kernel to
say, "hey, I gave it to you, you better go into the kernel when you can,
otherwise I'll preempt you no matter what!"

-- Steve