Hello Vincent, Thank you for taking a look at the series. On 3/6/2024 3:29 PM, Vincent Guittot wrote: > Hi Prateek, > > Adding Julia who could be interested in this patchset. Your patchset > should trigger idle load balance instead of newly idle load balance > now when the polling is used. This was one reason for not migrating > task in idle CPU Thank you. > > On Tue, 20 Feb 2024 at 18:15, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote: >> >> Hello everyone, >> >> [..snip..] >> >> >> Skipping newidle_balance() >> ========================== >> >> In an earlier attempt to solve the challenge of the long IRQ disabled >> section, newidle_balance() was skipped when a CPU waking up from idle >> was found to have no runnable tasks, and was transitioning back to >> idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() >> may be viable for CPUs that are idling with tick enabled, where the >> newidle_balance() has the opportunity to pull tasks onto the idle CPU. >> >> Vincent [5] pointed out a case where the idle load kick will fail to >> run on an idle CPU since the IPI handler launching the ILB will check >> for need_resched(). In such cases, the idle CPU relies on >> newidle_balance() to pull tasks towards itself. > > Calling newidle_balance() instead of the normal idle load balance > prevents the CPU to pull tasks from other groups Thank you for the correction. > >> >> Using an alternate flag instead of NEED_RESCHED to indicate a pending >> IPI was suggested as the correct approach to solve this problem on the >> same thread. >> >> >> Proposed solution: TIF_NOTIFY_IPI >> ================================= >> >> Instead of reusing TIF_NEED_RESCHED bit to pull an TIF_POLLING CPU out >> of idle, TIF_NOTIFY_IPI is a newly introduced flag that >> call_function_single_prep_ipi() sets on a target TIF_POLLING CPU to >> indicate a pending IPI, which the idle CPU promises to process soon. >> >> On architectures that do not support the TIF_NOTIFY_IPI flag (this >> series only adds support for x86 and ARM processors for now), > > I'm surprised that you are mentioning ARM processors because they > don't use TIF_POLLING. Yup I just realised that after Linus Walleij pointed it out on the thread. > >> call_function_single_prep_ipi() will fallback to setting >> TIF_NEED_RESCHED bit to pull the TIF_POLLING CPU out of idle. >> >> Since the pending IPI handlers are processed before the call to >> schedule_idle() in do_idle(), schedule_idle() will only be called if the >> IPI handler have woken / migrated a new task on the idle CPU and has set >> TIF_NEED_RESCHED bit to indicate the same. This avoids running into the >> long IRQ disabled section in schedule_idle() unnecessarily, and any >> need_resched() check within a call function will accurately notify if a >> task is waiting for CPU time on the CPU handling the IPI. >> >> Following is the crude visualization of how the situation changes with >> the newly introduced TIF_NOTIFY_IPI flag: >> -- >> CPU0 CPU1 >> ==== ==== >> do_idle() { >> __current_set_polling(); >> ... >> monitor(addr); >> if (!need_resched_or_ipi()) >> mwait() { >> /* Waiting */ >> smp_call_function_single(CPU1, func, wait = 1) { ... >> ... ... >> set_nr_if_polling(CPU1) { ... >> /* Realizes CPU1 is polling */ ... >> try_cmpxchg(addr, ... >> &val, ... >> val | _TIF_NOTIFY_IPI); ... >> } /* Does not send an IPI */ ... >> ... } /* mwait exit due to write at addr */ >> csd_lock_wait() { ... >> /* Waiting */ preempt_fold_need_resched(); /* fold if NEED_RESCHED */ >> ... __current_clr_polling(); >> ... flush_smp_call_function_queue() { >> ... func(); /* Will set NEED_RESCHED if sched_ttwu_pending() */ >> } /* End of wait */ } >> } if (need_resched()) { >> schedule_idle(); >> smp_call_function_single(CPU1, func, wait = 1) { } >> ... ... /* IRQs remain enabled */ >> arch_send_call_function_single_ipi(CPU1); -----------> /* Processes the IPI */ >> -- >> >> Results >> ======= >> >> With the TIF_NOTIFY_IPI, the time taken to complete a fixed set of IPIs >> using ipistorm improves drastically. Following are the numbers from the >> same dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on, >> C2 disabled) running ipistorm between CPU8 and CPU16: >> >> cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1 >> >> ================================================================== >> Test : ipistorm (modified) >> Units : Normalized runtime >> Interpretation: Lower is better >> Statistic : AMean >> ================================================================== >> kernel: time [pct imp] >> tip:sched/core 1.00 [0.00] >> tip:sched/core + revert 0.81 [19.36] >> tip:sched/core + TIF_NOTIFY_IPI 0.20 [80.99] >> >> Same experiment was repeated on an dual socket ARM server (2 x 64C) >> which too saw a significant improvement in the ipistorm performance: > > Could you share more details about this ARM server ? Could it be an Arm64 one ? > I was not expecting any change for arm/arm64 which are not using TIF_POLLING I looked at the lscpu output and it said It was an "aarch64" server with model name "Neoverse-N1". Let me go back and test it once again just to be sure I did not catch a one off behavior (Might be a while since I have limited access to this machine) I'll also add a debug WARN_ON_ONCE() to see if "TIF_NOTIF_IPI" is being set. > > >> >> ================================================================== >> Test : ipistorm (modified) >> Units : Normalized runtime >> Interpretation: Lower is better >> Statistic : AMean >> ================================================================== >> kernel: time [pct imp] >> tip:sched/core 1.00 [0.00] >> tip:sched/core + TIF_NOTIFY_IPI 0.41 [59.29] >> >> netperf and tbench results with the patch match the results on tip on >> the dual socket 3rd Generation AMD system (2 x 64C/128T). Additionally, >> hackbench, stream, and schbench too were tested, with results from the >> patched kernel matching that of the tip. >> >> >> Future Work >> =========== >> >> Evaluate impact of newidle_balance() when scheduler tick hits an idle >> CPU. The call to newidle_balance() will be skipped with the > > But it should call the normal idle load balance instead Yup, but the frequency of normal idle balance will be lower than the frequency at which a newidle balance is being triggered currently if tick is not disabled right? Please correct me if I'm wrong. > >> TIF_NOTIFY_IPI solution similar to [2]. Counter argument for the case is >> that if the idle state did not set the TIF_POLLING bit, the idle CPU >> would not have called schedule_idle() unless the IPI handler set the >> NEED_RESCHED bit. >> >> >> Links >> ===== >> >> [1] https://github.com/antonblanchard/ipistorm >> [2] https://lore.kernel.org/lkml/20240119084548.2788-1-kprateek.nayak@xxxxxxx/ >> [3] https://lore.kernel.org/lkml/b4f5ac150685456cf45a342e3bb1f28cdd557a53.camel@xxxxxxxxxxxxxxx/ >> [4] https://lore.kernel.org/lkml/20240123211756.GA221793@maniforge/ >> [5] https://lore.kernel.org/lkml/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@xxxxxxxxxxxxxx/ >> >> This series is based on tip:sched/core at tag "sched-core-2024-01-08". >> [..snip..] >> -- Thanks and Regards, Prateek