On Sat, 1 Feb 2025 19:11:29 +0100 Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > >I still have full hate for this approach. > > > > So what approach would you prefer? > > The one that does not rely on the preemption method -- I think I posted > something along those line, and someone else recently reposted something > bsaed on it. > > Tying things to the preemption method is absurdly bad design -- and I've > told you that before. How exactly is it "bad design"? Changing the preemption method itself changes the way applications schedule and can be very noticeable to the applications themselves. No preempt, applications will have high latency every time any application does a system call. Preempt voluntary is a little more reactive, but more randomly done. The preempt lazy kconfig has: This option provides a scheduler driven preemption model that is fundamentally similar to full preemption, but is less eager to preempt SCHED_NORMAL tasks in an attempt to reduce lock holder preemption and recover some of the performance gains seen from using Voluntary preemption. This could be a config option called PREEMPT_USER_LAZY that extends the "reduce lock holder preemption of user space spin locks". But if your issue is with relying on the preemption method, does that mean you prefer to have this feature for any preemption method? That may require still using the LAZY flag that can cause a schedule in the kernel but not in user space? Note, my group is actually more interested in implementing this for VMs. But that requires another level of redirection of the pointers. That is, qemu could create a device that shares memory between the guest kernel and the qemu VCPU thread. The guest kernel could update the counter in this shared memory before grabbing a raw_spin_lock which act like this patch set does. The difference would be that the counter would need to live in a memory page that only has this information in it and not the rseq structure itself. Mathieu was concerned about leaks and corruption in the rseq structure by a malicious guest. Thus, the counter would have to be a clean memory page that is shared between the guest and the qemu thread. The rseq would then have a pointer to this memory, and the host kernel would then have to traverse that pointer to the location of the counter. In other words, my real goal is to have this working for guests on their raw_spin_locks. We first tried to do this in KVM directly, but the KVM maintainers said this is more a generic scheduling issue and doesn't belong in KVM. I agreed with them. -- Steve