On Wed, May 14, 2014 at 11:08:35AM -0400, Paul Gortmaker wrote: > As of the old commit ac086bc22997a2be24fc40fc8d46522fe7e03d11 > ("sched: rt-group: smp balancing") the concept of borrowing per > cpu rt_runtime from one core to another was introduced. > > However, this prevents the RT throttling message from ever being > emitted when someone does a common (but mistaken) attempt at > using too much CPU in RT context. Consider the following test: > > echo "main() {for(;;);}" > full_load.c > gcc full_load.c -o full_load > taskset -c 1 ./full_load & > chrt -r -p 80 `pidof full_load` > > When run on x86_64 defconfig, what happens is as follows: > > -task runs on core1 for 95% of an rt_period as documented in > the file Documentation/scheduler/sched-rt-group.txt > > -at 95%, the code in balance_runtime sees this threshold and > calls do_balance_runtime() > > -do_balance_runtime sees that core 1 is in need, and does this: > --------------- > if (rt_rq->rt_runtime + diff > rt_period) > diff = rt_period - rt_rq->rt_runtime; > iter->rt_runtime -= diff; > rt_rq->rt_runtime += diff; > --------------- > which extends core1's rt_runtime by 5%, making it 100% of rt_period > by stealing 5% from core0 (or possibly some other core). > > However, the next time core1's rt_rq enters sched_rt_runtime_exceeded(), > we hit this near the top of that function: > --------------- > if (runtime >= sched_rt_period(rt_rq)) > return 0; > --------------- > and hence we'll _never_ look at/set any of the throttling checks and > messages in sched_rt_runtime_exceeded(). Instead, we will happily > plod along for CONFIG_RCU_CPU_STALL_TIMEOUT seconds, at which point > the RCU subsystem will get angry and trigger an NMI in response to > what it rightly sees as a WTF situation. In theory, one way of making RCU OK with an RT usermode CPU hog is to build with Frederic's CONFIG_NO_HZ_FULL=y. This will cause RCU to see CPUs having a single runnable usermode task as idle, preventing the RCU CPU stall warning. This does work well for mainline kernel in the lab. In practice, not sure how much testing CONFIG_NO_HZ_FULL=y has received for -rt kernels in production environments. But leaving practice aside for the moment... > Granted, there are lots of ways you can do bad things to yourself with > RT, but in the current zeitgeist of multicore systems with people > dedicating individual cores to individual tasks, I'd say the above is > common enough that we should react to it sensibly, and an RCU stall > really doesn't translate well to an end user vs a simple message that > says "throttling activated". > > One way to get the throttle message instead of the ambiguous and lengthy > NMI triggered all core backtrace of the RCU stall is to change the > SCHED_FEAT(RT_RUNTIME_SHARE, true) to false. One could make a good > case for this being the default for the out-of-tree preempt-rt series, > since folks using that are more apt to be manually tuning the system > and won't want an invisible hand coming in and making changes. > > However, in mainline, where it is more likely that there will be > n+x (x>0) RT tasks on an n core system, we can leave the sharing on, > and still avoid the RCU stalls by noting that there is no point in > trying to balance when there are no tasks to migrate, or only a > single RT task is present. Inflating the rt_runtime does nothing > in this case other than defeat sched_rt_runtime_exceeded(). > > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Paul Gortmaker <paul.gortmaker@xxxxxxxxxxxxx> > --- > > [I'd mentioned a similar use case here: https://lkml.org/lkml/2013/3/6/338 > and tglx asked why they wouldn't see the throttle message; it is only > now that I had a chance to dig in and figure out why. Oh, and the patch > is against linux-next, in case that matters...] > > kernel/sched/rt.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index ea4d500..698aac9 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -774,6 +774,15 @@ static int balance_runtime(struct rt_rq *rt_rq) > if (!sched_feat(RT_RUNTIME_SHARE)) > return more; > > + /* > + * Stealing from another core won't help us at all if > + * we have nothing to migrate over there, or only one > + * task that is running up all the rt_time. In fact it > + * will just inhibit the throttling message in that case. > + */ > + if (!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1) How about something like the following to take NO_HZ_FULL into account? + if ((!rt_rq->rt_nr_migratory || rt_rq->rt_nr_total == 1) && + !tick_nohz_full_cpu(cpu)) Thanx, Paul > + return more; > + > if (rt_rq->rt_time > rt_rq->rt_runtime) { > raw_spin_unlock(&rt_rq->rt_runtime_lock); > more = do_balance_runtime(rt_rq); > -- > 1.8.2.3 > -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html