Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 28 Mar 2012 15:51:40 +0200

On Tue, Mar 27, 2012 at 02:00:12AM +0200, Andrea Arcangeli wrote:
> On Mon, Mar 26, 2012 at 08:57:03PM +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > > Add SMT awareness to the NUMA scheduler so that it will not move load
> > > from fully idle SMT threads, to semi idle SMT threads.
> > 
> > This shows a complete fail in design, you're working around the regular
> > scheduler/load-balancer instead of with it and hence are duplicating all
> > kinds of stuff.
> > 
> > I'll not have that..
> 
> I think here you're misunderstanding implementation issues with
> design.
> 
> I already mentioned the need of closer integration in CFS as point 4
> of my TODO list in the first email of this thread. The current

I pushed an autonuma-alpha11 branch where I dropped the SMT logic
entirely from the AutoNUMA scheduler. The one you naked. Not just
that, I dropped the idle balancing as well.

It seems slower to react but its active idle balancing is smarter and
in average it's maxing out the memory channels bandwidth better now.

I hope I eliminated the code duplication. What remains AutoNUMA is the
NUMA load active balancing which CFS has zero clues about.

I did a full regression test and it passed it, and now multi instance
stream shall also run much faster with nr_process > 1 and nr_process <
nr_cpus/2.

About the need of closer integration with CFS, note also that your
kernel/sched/numa.c code was doing things like:

+       // XXX should be sched_domain aware
+       for_each_online_node(node) {

So I hope you will understand why I had to took a bit of shortcuts but
over time I'm fully committed to integrate numa.c better wherever
possible and especially remove the call at every schedule so it will
scale fine to thousand of CPUs. It's just not trivial to do it.

About your code, I've an hard time to believe that driving the
scheduler depending on an home node static placement decided at
exec/fork like your code does, could have a chance to compete with the
AutoNUMA math for workloads with very variable load and several
threads and processes going idle and loading the CPUs again. Real life
unfortunately isn't as trivial as a multi instance stream. I believe
you can handle multi instance streams ok though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>