The following commit has been merged into the sched/urgent branch of tip: Commit-ID: f5aaff7bfa11fb0b2ee6b8fd7bbc16cfceea2ad3 Gitweb: https://git.kernel.org/tip/f5aaff7bfa11fb0b2ee6b8fd7bbc16cfceea2ad3 Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> AuthorDate: Thu, 10 Oct 2024 08:28:36 Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> CommitterDate: Fri, 11 Oct 2024 10:49:33 +02:00 sched/core: Dequeue PSI signals for blocked tasks that are delayed psi_dequeue() in for blocked task expects psi_sched_switch() to clear the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task is blocked or still runnable which is no longer true with DELAY_DEQUEUE since a blocking task can be left queued on the runqueue. This can lead to PSI splats similar to: psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4 when the task is requeued since the TSK_RUNNING flag was not cleared when the task was blocked. Explicitly communicate that the task was blocked to psi_sched_switch() even if it was delayed and is still on the runqueue. [ prateek: Broke off the relevant part from [1], commit message ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@xxxxxx/ Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@xxxxxxxxxxxxx/ Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Not-yet-signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Tested-by: Johannes Weiner <hannes@xxxxxxxxxxx> Link: https://lore.kernel.org/lkml/20241004123506.GR18071@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [1] --- kernel/sched/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a860996..9e09140 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode) * as a preemption by schedule_debug() and RCU. */ bool preempt = sched_mode > SM_NONE; + bool block = false; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; @@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode) * After this, schedule() must not care about p->state any more. */ block_task(rq, prev, flags); + block = true; } switch_count = &prev->nvcsw; } @@ -6667,7 +6669,7 @@ picked: migrate_disable_switch(rq, prev); psi_account_irqtime(rq, prev, next); - psi_sched_switch(prev, next, !task_on_rq_queued(prev)); + psi_sched_switch(prev, next, block); trace_sched_switch(preempt, prev, next, prev_state);