On Fri, 07 Dec 2012 18:56:15 -0500 Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: > I've been debugging large latencies on a 40 core box and found a major > cause due to the thundering herd like grab of the rq lock due to the > pull_rt_task() logic. > > Basically, if a large number of CPUs were to lower its priority roughly > the same time, they would all trigger a pull. If there happens to be > only one CPU available to get a task, all CPUs doing the pull will try > to grab it. In doing so, they will all contend on the rq lock of > the overloaded CPU. Only one CPU will succeed in pulling the task > and unfortunately, there's no quick way to know which, as it's dependent > on the affinitiy of the task that needs to be pulled, and to look at that, > we need to grab its rq lock! > > Instead of having the pull logic grab the rq locks and do the work to > switch the task over to the pulling CPU, this patch series (well patch > #3) has the pulling CPU send an IPI to the overloaded CPU and that > CPU will do the push instead. The push logic uses the cpupri.c code > to quickly find the best CPU to offload the overloaded RT task to, so > it makes it quite efficient to do this. > > Retrieving multiple IPIs has a much lower overhead than all the CPUs > grabbing the rq lock. > > The other three patches are fixes/enhancements to the push/pull code > that I found while doing the debugging of the latencies. > > Note, although this patch series is made for the -rt patch, the issues > apply to mainline as well. But because -rt has the migrate_disable() code, > this patch series is tailored to that. But if we can vet this out in > -rt, all this code should make its way quickly to mainline. > > I tested this code out, but it probably needs some clean up and definitely > more comments. I'm only posting this as an RFC for now to get feedback > on the idea. > > Thanks! > Steve, I've been running this set of patches on my laptop+RT kernel since Friday with no ill-effects. I just applied it to v3.6.10+rt21 and it seems to be fine. I've got rteval runs going on a 40-core and a 24-core box which will be done early Tuesday morning so I'll let you know results then. Clark
Attachment:
signature.asc
Description: PGP signature