Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> · Tue, 30 Jul 2013 15:16:50 +0530

* Peter Zijlstra <peterz@xxxxxxxxxxxxx> [2013-07-30 11:10:21]:

> On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> [2013-07-30 10:20:01]:
> > 
> > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > > This results in much improved performance. Again I would assume this work
> > > > > is complementary to Mel's work with numa faulting.
> > > > 
> > > > I highly dislike the use of task weights here. It seems completely
> > > > unrelated to the problem at hand.
> > > 
> > > I also don't particularly like the fact that it's purely process based.
> > > The faults information we have gives much richer task relations.
> > > 
> > 
> > With just pure fault information based approach, I am not seeing any
> > major improvement in tasks/memory consolidation. I still see memory
> > spread across different nodes and tasks getting ping-ponged to different
> > nodes. And if there are multiple unrelated processes, then we see a mix
> > of tasks of different processes in each of the node.
> 
> The fault thing isn't finished. Mel explicitly said it doesn't yet have
> inter-task relations. And you run everything in a VM which is like a big
> nasty mangler for anything sane.
> 

I am not against fault and fault based handling is very much needed. 
I have listed that this approach is complementary to numa faults that
Mel is proposing. 

Right now I think if we can first get the tasks to consolidate on nodes
and then use the numa faults to place the tasks, then we would be able
to have a very good solution. 

Plain fault information is actually causing confusion in enough number
of cases esp if the initial set of pages is all consolidated into fewer
set of nodes. With plain fault information, memory follows cpu, cpu
follows memory are conflicting with each other. memory wants to move to
nodes where the tasks are currently running and the tasks are planning
to move nodes where the current memory is around.

Also most of the consolidation that I have proposed is pretty
conservative or either done at idle balance time. This would not affect
the numa faulting in any way. When I run with my patches (along with
some debug code), the consolidation happens pretty pretty quickly.
Once consolidation has happened, numa faults would be of immense value.

Here is how I am looking at the solution.

1. Till the initial scan delay, allow tasks to consolidate

2. After the first scan delay to the next scan delay, account numa
   faults, allow memory to move. But dont use numa faults as yet to
   drive scheduling decisions. Here also task continue to consolidate.

	This will lead to tasks and memory moving to specific nodes and
	leading to consolidation.

3. After the second scan delay, continue to account numa faults and
allow numa faults to drive scheduling decisions.

Should we use also use task weights at stage 3 or just numa faults or
which one should get more preference is something that I am not clear at
this time. At this time, I would think we would need to factor in both
of them.

I think this approach would mean tasks get consolidated but the inter
process, inter task relations that you are looking for also remain
strong.

Is this a acceptable solution?

-- 
Thanks and Regards
Srikar

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>