Re: [PATCH 6.12 440/466] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> · Fri, 13 Dec 2024 15:27:25 +0100

On Thu, Dec 12, 2024 at 11:22:25PM +0530, K Prateek Nayak wrote:
> Hello Greg, Sasha,
> 
> On 12/12/2024 8:30 PM, Greg Kroah-Hartman wrote:
> > 6.12-stable review patch.  If anyone has any objections, please let me know.
> > 
> > ------------------
> > 
> > From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> > 
> > [ Upstream commit ea9cffc0a154124821531991d5afdd7e8b20d7aa ]
> > 
> > The need_resched() check currently in nohz_csd_func() can be tracked
> > to have been added in scheduler_ipi() back in 2011 via commit
> > ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
> > 
> > Since then, it has travelled quite a bit but it seems like an idle_cpu()
> > check currently is sufficient to detect the need to bail out from an
> > idle load balancing. To justify this removal, consider all the following
> > case where an idle load balancing could race with a task wakeup:
> > 
> > o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
> >    on wakelist if wakee cpu is idle") a target perceived to be idle
> >    (target_rq->nr_running == 0) will return true for
> >    ttwu_queue_cond(target) which will offload the task wakeup to the idle
> >    target via an IPI.
> > 
> >    In all such cases target_rq->ttwu_pending will be set to 1 before
> >    queuing the wake function.
> > 
> >    If an idle load balance races here, following scenarios are possible:
> > 
> >    - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
> >      IPI is sent to the CPU to wake it out of idle. If the
> >      nohz_csd_func() queues before sched_ttwu_pending(), the idle load
> >      balance will bail out since idle_cpu(target) returns 0 since
> >      target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
> >      sched_ttwu_pending() it should see rq->nr_running to be non-zero and
> >      bail out of idle load balancing.
> > 
> >    - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
> >      the sender will simply set TIF_NEED_RESCHED for the target to put it
> >      out of idle and flush_smp_call_function_queue() in do_idle() will
> >      execute the call function. Depending on the ordering of the queuing
> >      of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
> >      nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
> >      target_rq->nr_running to be non-zero if there is a genuine task
> >      wakeup racing with the idle load balance kick.
> > 
> > o The waker CPU perceives the target CPU to be busy
> >    (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
> >    to a series of unfortunate events, the system reaches a case where the
> >    waker CPU decides to perform the wakeup by itself in ttwu_queue() on
> >    the target CPU but target is concurrently selected for idle load
> >    balance (XXX: Can this happen? I'm not sure, but we'll consider the
> >    mother of all coincidences to estimate the worst case scenario).
> > 
> >    ttwu_do_activate() calls enqueue_task() which would increment
> >    "rq->nr_running" post which it calls wakeup_preempt() which is
> >    responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
> >    setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
> >    thing to note in this case is that rq->nr_running is already non-zero
> >    in case of a wakeup before TIF_NEED_RESCHED is set which would
> >    lead to idle_cpu() check returning false.
> > 
> > In all cases, it seems that need_resched() check is unnecessary when
> > checking for idle_cpu() first since an impending wakeup racing with idle
> > load balancer will either set the "rq->ttwu_pending" or indicate a newly
> > woken task via "rq->nr_running".
> > 
> > Chasing the reason why this check might have existed in the first place,
> > I came across  Peter's suggestion on the fist iteration of Suresh's
> > patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
> > 
> > 	sched_ttwu_do_pending(list);
> > 
> > 	if (unlikely((rq->idle == current) &&
> > 	    rq->nohz_balance_kick &&
> > 	    !need_resched()))
> > 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> > 
> > Since the condition to raise the SCHED_SOFIRQ was preceded by
> > sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
> > the current upstream kernel, the need_resched() check was necessary to
> > catch a newly queued task. Peter suggested modifying it to:
> > 
> > 	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
> > 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> > 
> > where idle_cpu() seems to have replaced "rq->idle == current" check.
> > 
> > Even back then, the idle_cpu() check would have been sufficient to catch
> > a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize
> > send_call_function_single_ipi()") overloads the interpretation of
> > TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
> > need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
> > on Peter's suggestion.
> 
> Since v6.12 added support for PREEMPT_RT, you'll see the following
> warning being triggered when booting with PREEMPT_RT enabled on
> 6.12.5-rc1:
> 
>     ------------[ cut here ]------------
>     WARNING: CPU: 40 PID: 0 at kernel/softirq.c:292 do_softirq_post_smp_call_flush+0x1a/0x40
>     Modules linked in:
>     CPU: 40 UID: 0 PID: 0 Comm: swapper/40 Not tainted 6.12.5-rc1-test+ #220
>     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
>     RIP: 0010:do_softirq_post_smp_call_flush+0x1a/0x40
>     Code: ...
>     RSP: 0018:ffffad4c405cfeb8 EFLAGS: 00010002
>     RAX: 0000000000000080 RBX: 0000000000000282 RCX: 0000000000000007
>     RDX: 0000000000000000 RSI: 0000000000000083 RDI: 0000000000000000
>     RBP: 0000000000000000 R08: ffff942efc626080 R09: 0000000000000001
>     R10: 7fffffffffffffff R11: ffffffffffd2d2da R12: 0000000000000000
>     R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>     FS:  0000000000000000(0000) GS:ffff942efc600000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 0000000000000000 CR3: 000000807da48001 CR4: 0000000000f70ef0
>     PKRU: 55555554
>     Call Trace:
>      <TASK>
>      ? __warn+0x88/0x130
>      ? do_softirq_post_smp_call_flush+0x1a/0x40
>      ? report_bug+0x18e/0x1a0
>      ? handle_bug+0x5b/0xa0
>      ? exc_invalid_op+0x18/0x70
>      ? asm_exc_invalid_op+0x1a/0x20
>      ? do_softirq_post_smp_call_flush+0x1a/0x40
>      ? srso_alias_return_thunk+0x5/0xfbef5
>      flush_smp_call_function_queue+0x65/0x80
>      do_idle+0x149/0x260
>      cpu_startup_entry+0x29/0x30
>      start_secondary+0x12d/0x160
>      common_startup_64+0x13e/0x141
>      </TASK>
>     ---[ end trace 0000000000000000 ]---
> 
> Could you please also include upstream commit 6675ce20046d ("softirq:
> Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel") to the
> 6.12 stable queue to prevent this splat for PREEMPT_RT users.
> 
> Full upstream commit SHA1: 6675ce20046d149e1e1ffe7e9577947dee17aad5
> 
> The commit can be cleanly cherry-picked on top of v6.12.5-rc1 and I can
> confirm that it fixes the splat.

Thanks, I've now queued that up.  You all should have put a "Fixes:" tag
on it so that I would have noticed it automatically...

thanks,

greg k-h