Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

Steven Rostedt <rostedt@xxxxxxxxxxx> · Sat, 1 Feb 2025 18:06:17 -0500

On Sat, 1 Feb 2025 19:11:29 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote:
> > 
> > 
> > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >   
> > >I still have full hate for this approach.  
> > 
> > So what approach would you prefer?  
> 
> The one that does not rely on the preemption method -- I think I posted
> something along those line, and someone else recently reposted something
> bsaed on it.
> 
> Tying things to the preemption method is absurdly bad design -- and I've
> told you that before.

How exactly is it "bad design"? Changing the preemption method itself
changes the way applications schedule and can be very noticeable to the
applications themselves.

No preempt, applications will have high latency every time any
application does a system call.

Preempt voluntary is a little more reactive, but more randomly done.

The preempt lazy kconfig has:

          This option provides a scheduler driven preemption model that
          is fundamentally similar to full preemption, but is less
          eager to preempt SCHED_NORMAL tasks in an attempt to
          reduce lock holder preemption and recover some of the performance
          gains seen from using Voluntary preemption.

This could be a config option called PREEMPT_USER_LAZY that extends the
"reduce lock holder preemption of user space spin locks".

But if your issue is with relying on the preemption method, does that
mean you prefer to have this feature for any preemption method? That may
require still using the LAZY flag that can cause a schedule in the
kernel but not in user space?

Note, my group is actually more interested in implementing this for
VMs. But that requires another level of redirection of the pointers.

That is, qemu could create a device that shares memory between the
guest kernel and the qemu VCPU thread. The guest kernel could update
the counter in this shared memory before grabbing a raw_spin_lock which
act like this patch set does. The difference would be that the counter
would need to live in a memory page that only has this information in
it and not the rseq structure itself. Mathieu was concerned about leaks
and corruption in the rseq structure by a malicious guest. Thus, the
counter would have to be a clean memory page that is shared between the
guest and the qemu thread.

The rseq would then have a pointer to this memory, and the host kernel
would then have to traverse that pointer to the location of the counter.

In other words, my real goal is to have this working for guests on
their raw_spin_locks. We first tried to do this in KVM directly, but
the KVM maintainers said this is more a generic scheduling issue and
doesn't belong in KVM. I agreed with them.

-- Steve