Re: [PATCH RFC WIP] Process weights based scheduling for better consolidation

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Fri, 5 Jul 2013 12:16:54 +0200

On Thu, Jul 04, 2013 at 11:32:27PM +0530, Srikar Dronamraju wrote:
> Here is an approach to look at numa balanced scheduling from a non numa fault
> angle. This approach uses process weights instead of faults as a basis to
> move or bring tasks together.

That doesn't make any sense..... how would weight be related to numa
placement?

What it appears to do it simply group tasks based on ->mm. And by
keeping them somewhat sticky to the same node it gets locality.

What about multi-process shared memory workloads? Its one of the things
I disliked about autonuma. It completely disregards the multi-process
scenario.

If you want to go without faults; you also won't migrate memory along
and if you just happen to place your workload elsewhere you've no idea
where your memory is. If you have the faults, you might as well account
them to get a notion of where the memory is at; its nearly free at that
point anyway.

Load spikes/fluctuations can easily lead to transient task movement to
keep balance. If these movements are indeed transient you want to return
to where you came from; however if they are not.. you want the memory to
come to you.

> +static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +	unsigned long task_load = 0;
> +	int curnode = cpu_to_node(cpu_of(rq));
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq)];
> +	if (!se)
> +		return;
> +
> +	if (cfs_rq->load.weight) {
> +		task_load =  p->se.load.weight * se->load.weight;
> +		task_load /= cfs_rq->load.weight;
> +	} else {
> +		task_load = 0;
> +	}
> +#else
> +	task_load = p->se.load.weight;
> +#endif

This looks broken; didn't you want to use task_h_load() here? There's
nothing autogroup specific about task_load. If anything you want to do
full cgroup which I think reduces to task_h_load() here.

> +	p->task_load = 0;
> +	if (!task_load)
> +		return;
> +
> +	if (p->mm && p->mm->numa_weights) {
> +		p->mm->numa_weights[curnode] += task_load;
> +		p->mm->numa_weights[nr_node_ids] += task_load;
> +	}
> +
> +	if (p->nr_cpus_allowed != num_online_cpus())
> +		rq->pinned_load += task_load;
> +	p->task_load = task_load;
> +}
> +

> @@ -5529,6 +5769,76 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
>  		if (!balance)
>  			break;
>  	}
> +#ifdef CONFIG_NUMA_BALANCING
> +	if (!rq->nr_running) {

This would only work for under utilized systems...

> +	}
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>