On Thu, 13 Dec 2012, Steven Rostedt wrote: > I didn't get a chance to test the latest IPI patch series on the 40 core > box, and only had my 4 way box to test on. But I was able to test it > last night and found some issues. > > The RT_PUSH_IPI doesn't get automatically set because just doing the > sched_feat_enable() wasn't enough. Below is the corrected patch. > > Also, for some reason patch 3 caused the box to hang. Perhaps it > required the RT_PUSH_IPI set, because it worked with the original patch > series. But that series only did the push ipi. I removed it on the 40 > core before noticing that the RT_PUSH_IPI wasn't being automatically > enabled. > > Here's an update of patch 4: > > sched/rt: Use IPI to trigger RT task push migration instead of pulling > > When debugging the latencies on a 40 core box, where we hit 300 to > 500 microsecond latencies, I found there was a huge contention on the > runqueue locks. > > Investigating it further, running ftrace, I found that it was due to > the pulling of RT tasks. > > The test that was run was the following: > > cyclictest --numa -p95 -m -d0 -i100 > > This created a thread on each CPU, that would set its wakeup in interations > of 100 microseconds. The -d0 means that all the threads had the same > interval (100us). Each thread sleeps for 100us and wakes up and measures > its latencies. > > What happened was another RT task would be scheduled on one of the CPUs > that was running our test, when the other CPUS test went to sleep and > scheduled idle. This cause the "pull" operation to execute on all > these CPUs. Each one of these saw the RT task that was overloaded on > the CPU of the test that was still running, and each one tried > to grab that task in a thundering herd way. > > To grab the task, each thread would do a double rq lock grab, grabbing > its own lock as well as the rq of the overloaded CPU. As the sched > domains on this box was rather flat for its size, I saw up to 12 CPUs > block on this lock at once. This caused a ripple affect with the > rq locks. As these locks were blocked, any wakeups or load balanceing > on these CPUs would also block on these locks, and the wait time escalated. > > I've tried various methods to lesson the load, but things like an > atomic counter to only let one CPU grab the task wont work, because > the task may have a limited affinity, and we may pick the wrong > CPU to take that lock and do the pull, to only find out that the > CPU we picked isn't in the task's affinity. > > Instead of doing the PULL, I now have the CPUs that want the pull to > send over an IPI to the overloaded CPU, and let that CPU pick what > CPU to push the task to. No more need to grab the rq lock, and the > push/pull algorithm still works fine. > > With this patch, the latency dropped to just 150us over a 20 hour run. > Without the patch, the huge latencies would trigger in seconds. > > Now, this issue only seems to apply to boxes with greater than 16 CPUs. > We noticed this on a 24 CPU box, and things got much worse on 40 (and > presumably more CPUs would get even worse yet). But running with 16 > CPUs and below, the lock contention caused by the pulling of RT tasks > is not noticable. > > I've created a new sched feature called RT_PUSH_IPI, which by default > on 16 and less core CPUs is disabled, and on 17 or more CPUs it is > enabled. That seems to be heuristic limit where the pulling logic > causes higher latencies than IPIs. Of course with all heuristics, things > could be different with different architectures. > > When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks > and having the pulling CPU do the work is implemented. When RT_PUSH_IPI > is enabled, the IPI is sent to the overloaded CPU to do a push. > > To enabled or disable this at run time: > > # mount -t debugfs nodev /sys/kernel/debug > # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features > or > # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features > > Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx> > > Index: rt-linux.git/kernel/sched/core.c > =================================================================== > --- rt-linux.git.orig/kernel/sched/core.c > +++ rt-linux.git/kernel/sched/core.c > @@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void) > > void scheduler_ipi(void) > { > + if (sched_feat(RT_PUSH_IPI)) > + sched_rt_push_check(); > + > if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()) > return; > > @@ -7541,6 +7544,21 @@ void __init sched_init_smp(void) > free_cpumask_var(non_isolated_cpus); > > init_sched_rt_class(); > + > + /* > + * To avoid heavy contention on large CPU boxes, > + * when there is an RT overloaded CPU (two or more RT tasks > + * queued to run on a CPU and one of the waiting RT tasks > + * can migrate) and another CPU lowers its priority, instead > + * of grabbing both rq locks of the CPUS (as many CPUs lowering > + * their priority at the same time may create large latencies) > + * send an IPI to the CPU that is overloaded so that it can > + * do an efficent push. > + */ > + if (num_possible_cpus() > 16) { > + sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI); > + sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI); > + } > } > #else > void __init sched_init_smp(void) > Index: rt-linux.git/kernel/sched/rt.c > =================================================================== > --- rt-linux.git.orig/kernel/sched/rt.c > +++ rt-linux.git/kernel/sched/rt.c > @@ -1723,6 +1723,31 @@ static void push_rt_tasks(struct rq *rq) > ; > } > > +/** > + * sched_rt_push_check - check if we can push waiting RT tasks > + * > + * Called from sched IPI when sched feature RT_PUSH_IPI is enabled. > + * > + * Checks if there is an RT task that can migrate and there exists > + * a CPU in its affinity that only has tasks lower in priority than > + * the waiting RT task. If so, then it will push the task off to that > + * CPU. > + */ > +void sched_rt_push_check(void) > +{ > + struct rq *rq = cpu_rq(smp_processor_id()); > + > + if (WARN_ON_ONCE(!irqs_disabled())) > + return; > + > + if (!has_pushable_tasks(rq)) > + return; > + > + raw_spin_lock(&rq->lock); > + push_rt_tasks(rq); > + raw_spin_unlock(&rq->lock); > +} > + > static int pull_rt_task(struct rq *this_rq) > { > int this_cpu = this_rq->cpu, ret = 0, cpu; > @@ -1750,6 +1775,18 @@ static int pull_rt_task(struct rq *this_ > continue; > > /* > + * When the RT_PUSH_IPI sched feature is enabled, instead > + * of trying to grab the rq lock of the RT overloaded CPU > + * send an IPI to that CPU instead. This prevents heavy > + * contention from several CPUs lowering its priority > + * and all trying to grab the rq lock of that overloaded CPU. > + */ > + if (sched_feat(RT_PUSH_IPI)) { > + smp_send_reschedule(cpu); > + continue; > + } > + > + /* > * We can potentially drop this_rq's lock in > * double_lock_balance, and another CPU could > * alter this_rq > Index: rt-linux.git/kernel/sched/sched.h > =================================================================== > --- rt-linux.git.orig/kernel/sched/sched.h > +++ rt-linux.git/kernel/sched/sched.h > @@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru > __release(rq2->lock); > } > > +void sched_rt_push_check(void); > + > #else /* CONFIG_SMP */ > > /* > @@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru > __release(rq2->lock); > } > > +void sched_rt_push_check(void) > +{ > +} > #endif > > extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq); > Index: rt-linux.git/kernel/sched/features.h > =================================================================== > --- rt-linux.git.orig/kernel/sched/features.h > +++ rt-linux.git/kernel/sched/features.h > @@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true) > # endif > #endif > > +/* > + * In order to avoid a thundering herd attack of CPUS that are > + * lowering their priorities at the same time, and there being > + * a single CPU that has an RT task that can migrate and is waiting > + * to run, where the other CPUs will try to take that CPUs > + * rq lock and possibly create a large contention, sending an > + * IPI to that CPU and let that CPU push the RT task to where > + * it should go may be a better scenario. > + * > + * This is default off for machines with <= 16 CPUs, and will > + * be turned on at boot up for machines with > 16 CPUs. > + */ > +SCHED_FEAT(RT_PUSH_IPI, false) > + > SCHED_FEAT(FORCE_SD_OVERLAP, false) > SCHED_FEAT(RT_RUNTIME_SHARE, true) > SCHED_FEAT(LB_MIN, false) > FWIW: Applying this to our latest test queue. Thanks John -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html