Patch "sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue" has been added to the 6.12-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

to the 6.12-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     sched-fair-fix-inaccurate-h_nr_runnable-accounting-w.patch
and it can be found in the queue-6.12 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 1f0309b88874381903bd24c0abe1a99875124349
Author: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Date:   Fri Jan 17 10:58:52 2025 +0000

    sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
    
    [ Upstream commit 3429dd57f0deb1a602c2624a1dd7c4c11b6c4734 ]
    
    set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
    entity is delayed irrespective of whether the entity corresponds to a
    task or a cfs_rq.
    
    Consider the following scenario:
    
            root
           /    \
          A      B (*) delayed since B is no longer eligible on root
          |      |
        Task0  Task1 <--- dequeue_task_fair() - task blocks
    
    When Task1 blocks (dequeue_entity() for task's se returns true),
    dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
    hierarchy of Task1. However, when the sched_entity corresponding to
    cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
    hierarchy too leading to both dequeue_entity() and set_delayed()
    decrementing h_nr_runnable for the dequeue of the same task.
    
    A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
    dequeue_entities() like below:
    
        cfs_rq->h_nr_runnable -= h_nr_runnable;
        SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
    
    is consistently tripped when running wakeup intensive workloads like
    hackbench in a cgroup.
    
    This error is self correcting since cfs_rq are per-cpu and cannot
    migrate. The entitiy is either picked for full dequeue or is requeued
    when a task wakes up below it. Both those paths call clear_delayed()
    which again increments h_nr_runnable of the hierarchy without
    considering if the entity corresponds to a task or not.
    
    h_nr_runnable will eventually reflect the correct value however in the
    interim, the incorrect values can still influence PELT calculation which
    uses se->runnable_weight or cfs_rq->h_nr_runnable.
    
    Since only delayed tasks take the early return path in
    dequeue_entities() and enqueue_task_fair(), adjust the
    h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
    this path skips the h_nr_* update loops and returns early.
    
    For entities corresponding to cfs_rq, the h_nr_* update loop in the
    caller will do the right thing.
    
    Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
    Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
    Reviewed-by: Gautham R. Shenoy <gautham.shenoy@xxxxxxx>
    Tested-by: Swapnil Sapkal <swapnil.sapkal@xxxxxxx>
    Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@xxxxxxx
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65e7be6448720..ddc096d6b0c20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static void set_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 1;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since dequeue_entities()
+	 * will account it for blocked tasks.
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
 static void clear_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 0;
+
+	/*
+	 * Delayed se of cfs_rq have no tasks queued on them.
+	 * Do not adjust h_nr_runnable since a dequeue has
+	 * already accounted for it or an enqueue of a task
+	 * below it will account for it in enqueue_task_fair().
+	 */
+	if (!entity_is_task(se))
+		return;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux