On Mon, 3 Feb 2025 09:43:06 +0100 Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > Lazy is not the default, nor even the recommended preemption method at > this time. That's OK. If it is considered to be the default in the future, this can wait. > > Lazy will not ever be the only preemption method, full isn't going > anywhere. That's fine too, as full preemption has the same issue of preempting kernel mutexes. Full preemption is for something that likely doesn't want this feature anyway. > > Lazy only applies to fair (and whatever bpf things end up using > resched_curr_lazy()). Is that a problem? User spin locks for RT tasks are very dangerous. If an RT task preempts the owner that is of lower priority, it can cause a deadlock (if the two tasks are pinned to the same CPU). Which BTW, Sebastion mentioned in the Stable RT meeting that glibc supplies a pthread_spin_lock() and doesn't have in the man page anything about this possible scenario. > > Lazy works on tick granularity, which is variable per the HZ config, and > way too long for any of this nonsense. Patch 2 changes that to do what you wrote the last time. It has a max wait time of 50us. > > So by tying this to lazy, you get something that doesn't actually work > most of the time, and when it works, it has variable and bad behaviour. Um no. If we wait for lazy to become the default behavior, it will work most of the time. And when it does work, it has strict behavior of 50us. > > So yeah, crap. As your rationale was not correct, I will disagree with this being crap. > > This really isn't difficult to understand, and I've told you this > before. And I listened to what you told me before. Patch 2 implements the 50us max that you suggested. I separated it out because it made the code simpler to understand and debug. The change log even mentioned: For the moment, it lets it run for one more tick (which will be changed later). That "changed later" is the second patch in this series. With the "this can wait until lazy is default", is because we have an "upstream first" policy. As long as there is some buy-in to the changes, we can go ahead and implement it on our devices. We do not have to wait for it to be accepted. But if there's a strong NAK to the idea, it is much harder to get it implemented internally. I would also implement a way for user space to know if it is supported or not. Perhaps have the cr_counter of the rseq initialized to some value that tells user space this is supported in the current configuration of the kernel? This would make there be "no surprises". Our current use case is actually for VMs. Which requires a slightly different method. Instead of having the cr_counter that is used for telling the kernel the task is in a critical section, the rseq would contain a pointer to some user space memory that has that counter. The reason is that this memory would need to be mapped between the VM guest kernel and the VM VCPU emulation thread. Mathieu did not want to allow exposure of the VM VCPU thread's rseq structure to the VM guest kernel. Having a separate memory map for that is more secure. Then the raw spin locks of the guest VM kernel could be implemented using this method as well. We do find performance issues when a VCPU of a guest kernel is preempted while holding spin locks. We can focus more on this VM use case, and we could then give better benchmarks. But again, his depends on whether or not you intend on NAKing this approach altogether. -- Steve