On Thu, Mar 28, 2024 at 8:21 AM mingyang.cui <mingyang.cui@xxxxxxxxxx> wrote: > > When rt_mutex_setprio changes a task's scheduling class to RT, > sometimes the task's vruntime is not updated correctly upon > return to the fair class. > Specifically, the following is being observed: > - task has just been created and running for a short time > - task sleep while still in the fair class > - task is boosted to RT via rt_mutex_setprio, which changes > the task to RT and calls check_class_changed. > - check_class_changed leads to detach_task_cfs_rq, at which point > the vruntime_normalized check sees that the task's sum_exec_runtime > is zero, which results in skipping the subtraction of the > rq's min_vruntime from the task's vruntime > - later, when the prio is deboosted and the task is moved back > to the fair class, the fair rq's min_vruntime is added to > the task's vruntime, even though it wasn't subtracted earlier. Just to make sure I am following: since sum_exec_runtime is updated in update_curr(), if the task goes to sleep, shouldn't it be dequeued and thus update_curr() would have been run (and thus sum_exec_runtime would be non-zero)? Maybe in this analysis is the new task being boosted while it is still newly running (instead of sleeping)? > Since the task's vruntime is about double that of other tasks in cfs_rq, > the task to be unable to run for a long time when there are continuous > runnable tasks in cfs_rq. > > The immediate result is inflation of the task's vruntime, giving > it lower priority (starving it if there's enough available work). > The longer-term effect is inflation of all vruntimes because the > task's vruntime becomes the rq's min_vruntime when the higher > priority tasks go idle. That leads to a vicious cycle, where > the vruntime inflation repeatedly doubled. This is an interesting find! I'm curious how you detected the problem, as it might make a good correctness test (which I'm selfishly looking for more of myself :) > The root cause of the problem is that the vruntime_normalized made a > misjudgment. Since the sum_exec_runtime of some tasks that were just > created and run for a short time is zero, the vruntime_normalized > mistakenly thinks that they are tasks that have just been forked. > Therefore, sum_exec_runtime is not subtracted from the vruntime of the > task. > > So, we fix this bug by adding a check condition for newly forked task. > > Signed-off-by: mingyang.cui <mingyang.cui@xxxxxxxxxx> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 73a89fbd81be..3d0c14f3731f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11112,7 +11112,7 @@ static inline bool vruntime_normalized(struct task_struct *p) > * - A task which has been woken up by try_to_wake_up() and > * waiting for actually being woken up by sched_ttwu_pending(). > */ > - if (!se->sum_exec_runtime || > + if (!se->sum_exec_runtime && p->state == TASK_NEW || > (p->state == TASK_WAKING && p->sched_remote_wakeup)) > return true; This looks like it was applied against an older tree? The p->state accesses should be under a READ_ONCE (and likely consolidated before the conditional?) Also, I wonder if alternatively a call to update_curr() if (cfs_rq->curr == se) in switched_from_fair() would be good (ie: normalize it on the boost to close the edge case rather than handle it as an expected non-normalized condition)? thanks -john