Steven Rostedt <rostedt@xxxxxxxxxxx> writes: > On Tue, 7 Nov 2023 13:56:46 -0800 > Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: > >> Hi, > > Hi Ankur, > > Thanks for doing this! > >> >> We have two models of preemption: voluntary and full (and RT which is >> a fuller form of full preemption.) In this series -- which is based >> on Thomas' PoC (see [1]), we try to unify the two by letting the >> scheduler enforce policy for the voluntary preemption models as well. > > I would say there's "NONE" which is really just a "voluntary" but with > fewer preemption points ;-) But still should be mentioned, otherwise people > may get confused. > >> >> (Note that this is about preemption when executing in the kernel. >> Userspace is always preemptible.) >> > > >> Design >> == >> >> As Thomas outlines in [1], to unify the preemption models we >> want to: always have the preempt_count enabled and allow the scheduler >> to drive preemption policy based on the model in effect. >> >> Policies: >> >> - preemption=none: run to completion >> - preemption=voluntary: run to completion, unless a task of higher >> sched-class awaits >> - preemption=full: optimized for low-latency. Preempt whenever a higher >> priority task awaits. >> >> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the >> scheduler to mark that a reschedule is needed, but is deferred until >> the task finishes executing in the kernel -- voluntary preemption >> as it were. >> >> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption >> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. >> >> ret-to-user ret-to-kernel preempt_count() >> none Y N N >> voluntary Y Y Y >> full Y Y Y > > Wait. The above is for when RESCHED_LAZY is to preempt, right? > > Then, shouldn't voluntary be: > > voluntary Y N N > > For LAZY, but > > voluntary Y Y Y > > For NEED_RESCHED (without lazy) Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags and in the middle switched to talking about how the voluntary model will get to what it wants. > That is, the only difference between voluntary and none (as you describe > above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED, > but on none, it still sets NEED_RESCHED_LAZY? Yeah exactly. Just to restate without mucking it up: The TIF_NEED_RESCHED flag is evaluated at all three of the preemption points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. ret-to-user ret-to-kernel preempt_count() NEED_RESCHED_LAZY Y N N NEED_RESCHED Y Y Y Based on how various preemption models set the flag they would cause preemption at: ret-to-user ret-to-kernel preempt_count() none Y N N voluntary Y Y Y full Y Y Y >> The max-load numbers (not posted here) also behave similarly. > > It would be interesting to run any "latency sensitive" benchmarks. > > I wounder how cyclictest would work under each model with and without this > patch? Didn't post these numbers because I suspect that code isn't quite right, but voluntary preemption for instance does what it promises: # echo NO_FORCE_PREEMPT > sched/features # echo NO_PREEMPT_PRIORITY > sched/features # preempt=none # stress-ng --cyclic 1 --timeout 10 stress-ng: info: [1214172] setting to a 10 second run per stressor stress-ng: info: [1214172] dispatching hogs: 1 cyclic stress-ng: info: [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [1214174] cyclic: mean: 9834.56 ns, mode: 3495 ns stress-ng: info: [1214174] cyclic: min: 2413 ns, max: 3145065 ns, std.dev. 77096.98 stress-ng: info: [1214174] cyclic: latency percentiles: stress-ng: info: [1214174] cyclic: 25.00%: 3366 ns stress-ng: info: [1214174] cyclic: 50.00%: 3505 ns stress-ng: info: [1214174] cyclic: 75.00%: 3776 ns stress-ng: info: [1214174] cyclic: 90.00%: 4316 ns stress-ng: info: [1214174] cyclic: 95.40%: 10989 ns stress-ng: info: [1214174] cyclic: 99.00%: 91181 ns stress-ng: info: [1214174] cyclic: 99.50%: 290477 ns stress-ng: info: [1214174] cyclic: 99.90%: 1360837 ns stress-ng: info: [1214174] cyclic: 99.99%: 3145065 ns stress-ng: info: [1214172] successful run completed in 10.00s # echo PREEMPT_PRIORITY > features # preempt=voluntary # stress-ng --cyclic 1 --timeout 10 stress-ng: info: [916483] setting to a 10 second run per stressor stress-ng: info: [916483] dispatching hogs: 1 cyclic stress-ng: info: [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [916484] cyclic: mean: 3682.77 ns, mode: 3185 ns stress-ng: info: [916484] cyclic: min: 2523 ns, max: 150082 ns, std.dev. 2198.07 stress-ng: info: [916484] cyclic: latency percentiles: stress-ng: info: [916484] cyclic: 25.00%: 3185 ns stress-ng: info: [916484] cyclic: 50.00%: 3306 ns stress-ng: info: [916484] cyclic: 75.00%: 3666 ns stress-ng: info: [916484] cyclic: 90.00%: 4778 ns stress-ng: info: [916484] cyclic: 95.40%: 5359 ns stress-ng: info: [916484] cyclic: 99.00%: 6141 ns stress-ng: info: [916484] cyclic: 99.50%: 7824 ns stress-ng: info: [916484] cyclic: 99.90%: 29825 ns stress-ng: info: [916484] cyclic: 99.99%: 150082 ns stress-ng: info: [916483] successful run completed in 10.01s This is with a background kernbench half-load. Let me see if I can dig out the numbers without this series. -- ankur