Re: [PATCH 6.12 440/466] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Greg, Sasha,

On 12/12/2024 8:30 PM, Greg Kroah-Hartman wrote:
6.12-stable review patch.  If anyone has any objections, please let me know.

------------------

From: K Prateek Nayak <kprateek.nayak@xxxxxxx>

[ Upstream commit ea9cffc0a154124821531991d5afdd7e8b20d7aa ]

The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")

Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:

o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
   on wakelist if wakee cpu is idle") a target perceived to be idle
   (target_rq->nr_running == 0) will return true for
   ttwu_queue_cond(target) which will offload the task wakeup to the idle
   target via an IPI.

   In all such cases target_rq->ttwu_pending will be set to 1 before
   queuing the wake function.

   If an idle load balance races here, following scenarios are possible:

   - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
     IPI is sent to the CPU to wake it out of idle. If the
     nohz_csd_func() queues before sched_ttwu_pending(), the idle load
     balance will bail out since idle_cpu(target) returns 0 since
     target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
     sched_ttwu_pending() it should see rq->nr_running to be non-zero and
     bail out of idle load balancing.

   - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
     the sender will simply set TIF_NEED_RESCHED for the target to put it
     out of idle and flush_smp_call_function_queue() in do_idle() will
     execute the call function. Depending on the ordering of the queuing
     of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
     nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
     target_rq->nr_running to be non-zero if there is a genuine task
     wakeup racing with the idle load balance kick.

o The waker CPU perceives the target CPU to be busy
   (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
   to a series of unfortunate events, the system reaches a case where the
   waker CPU decides to perform the wakeup by itself in ttwu_queue() on
   the target CPU but target is concurrently selected for idle load
   balance (XXX: Can this happen? I'm not sure, but we'll consider the
   mother of all coincidences to estimate the worst case scenario).

   ttwu_do_activate() calls enqueue_task() which would increment
   "rq->nr_running" post which it calls wakeup_preempt() which is
   responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
   setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
   thing to note in this case is that rq->nr_running is already non-zero
   in case of a wakeup before TIF_NEED_RESCHED is set which would
   lead to idle_cpu() check returning false.

In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".

Chasing the reason why this check might have existed in the first place,
I came across  Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

	sched_ttwu_do_pending(list);

	if (unlikely((rq->idle == current) &&
	    rq->nohz_balance_kick &&
	    !need_resched()))
		raise_softirq_irqoff(SCHED_SOFTIRQ);

Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:

	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
		raise_softirq_irqoff(SCHED_SOFTIRQ);

where idle_cpu() seems to have replaced "rq->idle == current" check.

Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.

Since v6.12 added support for PREEMPT_RT, you'll see the following
warning being triggered when booting with PREEMPT_RT enabled on
6.12.5-rc1:

    ------------[ cut here ]------------
    WARNING: CPU: 40 PID: 0 at kernel/softirq.c:292 do_softirq_post_smp_call_flush+0x1a/0x40
    Modules linked in:
    CPU: 40 UID: 0 PID: 0 Comm: swapper/40 Not tainted 6.12.5-rc1-test+ #220
    Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    RIP: 0010:do_softirq_post_smp_call_flush+0x1a/0x40
    Code: ...
    RSP: 0018:ffffad4c405cfeb8 EFLAGS: 00010002
    RAX: 0000000000000080 RBX: 0000000000000282 RCX: 0000000000000007
    RDX: 0000000000000000 RSI: 0000000000000083 RDI: 0000000000000000
    RBP: 0000000000000000 R08: ffff942efc626080 R09: 0000000000000001
    R10: 7fffffffffffffff R11: ffffffffffd2d2da R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS:  0000000000000000(0000) GS:ffff942efc600000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000807da48001 CR4: 0000000000f70ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? __warn+0x88/0x130
     ? do_softirq_post_smp_call_flush+0x1a/0x40
     ? report_bug+0x18e/0x1a0
     ? handle_bug+0x5b/0xa0
     ? exc_invalid_op+0x18/0x70
     ? asm_exc_invalid_op+0x1a/0x20
     ? do_softirq_post_smp_call_flush+0x1a/0x40
     ? srso_alias_return_thunk+0x5/0xfbef5
     flush_smp_call_function_queue+0x65/0x80
     do_idle+0x149/0x260
     cpu_startup_entry+0x29/0x30
     start_secondary+0x12d/0x160
     common_startup_64+0x13e/0x141
     </TASK>
    ---[ end trace 0000000000000000 ]---

Could you please also include upstream commit 6675ce20046d ("softirq:
Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel") to the
6.12 stable queue to prevent this splat for PREEMPT_RT users.

Full upstream commit SHA1: 6675ce20046d149e1e1ffe7e9577947dee17aad5

The commit can be cleanly cherry-picked on top of v6.12.5-rc1 and I can
confirm that it fixes the splat.

--
Thanks and Regards,
Prateek


Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@xxxxxxx
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---
  kernel/sched/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 76b27b2a9c56a..33bc43b223cba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1242,7 +1242,7 @@ static void nohz_csd_func(void *info)
  	WARN_ON(!(flags & NOHZ_KICK_MASK));
rq->idle_balance = idle_cpu(cpu);
-	if (rq->idle_balance && !need_resched()) {
+	if (rq->idle_balance) {
  		rq->nohz_idle_balance = flags;
  		raise_softirq_irqoff(SCHED_SOFTIRQ);
  	}





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux