Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

Steven Rostedt <rostedt@xxxxxxxxxxx> · Mon, 3 Feb 2025 11:45:37 -0500

On Mon, 3 Feb 2025 09:43:06 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> Lazy is not the default, nor even the recommended preemption method at
> this time.

That's OK. If it is considered to be the default in the future, this can
wait.

> 
> Lazy will not ever be the only preemption method, full isn't going
> anywhere.

That's fine too, as full preemption has the same issue of preempting
kernel mutexes. Full preemption is for something that likely doesn't want
this feature anyway.

> 
> Lazy only applies to fair (and whatever bpf things end up using
> resched_curr_lazy()).

Is that a problem? User spin locks for RT tasks are very dangerous. If an
RT task preempts the owner that is of lower priority, it can cause a
deadlock (if the two tasks are pinned to the same CPU). Which BTW,
Sebastion mentioned in the Stable RT meeting that glibc supplies a
pthread_spin_lock() and doesn't have in the man page anything about this
possible scenario.

> 
> Lazy works on tick granularity, which is variable per the HZ config, and
> way too long for any of this nonsense.

Patch 2 changes that to do what you wrote the last time. It has a max wait
time of 50us.

> 
> So by tying this to lazy, you get something that doesn't actually work
> most of the time, and when it works, it has variable and bad behaviour.

Um no. If we wait for lazy to become the default behavior, it will work
most of the time. And when it does work, it has strict behavior of 50us. 

> 
> So yeah, crap.

As your rationale was not correct, I will disagree with this being crap.

> 
> This really isn't difficult to understand, and I've told you this
> before.

And I listened to what you told me before. Patch 2 implements the 50us max
that you suggested. I separated it out because it made the code simpler to
understand and debug. The change log even mentioned:

     For the moment, it lets it run for one more tick (which will be
     changed later).

That "changed later" is the second patch in this series.

With the "this can wait until lazy is default", is because we have an
"upstream first" policy. As long as there is some buy-in to the changes, we
can go ahead and implement it on our devices. We do not have to wait for it
to be accepted. But if there's a strong NAK to the idea, it is much harder
to get it implemented internally.

I would also implement a way for user space to know if it is supported or
not. Perhaps have the cr_counter of the rseq initialized to some value that
tells user space this is supported in the current configuration of the
kernel? This would make there be "no surprises".

Our current use case is actually for VMs. Which requires a slightly
different method. Instead of having the cr_counter that is used for telling
the kernel the task is in a critical section, the rseq would contain a
pointer to some user space memory that has that counter. The reason is that
this memory would need to be mapped between the VM guest kernel and the VM
VCPU emulation thread. Mathieu did not want to allow exposure of the VM
VCPU thread's rseq structure to the VM guest kernel. Having a separate
memory map for that is more secure.

Then the raw spin locks of the guest VM kernel could be implemented using
this method as well. We do find performance issues when a VCPU of a guest
kernel is preempted while holding spin locks.

We can focus more on this VM use case, and we could then give better
benchmarks. But again, his depends on whether or not you intend on NAKing
this approach altogether.

-- Steve