Re: [PATCH] sched: fix group_entity's share update

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Thu, 15 Dec 2016 22:42:14 +0100

On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
> The update of the share of a cfs_rq is done when its load_avg is updated
> but before the group_entity's load_avg has been updated for the past time
> slot. This generates wrong load_avg accounting which can be significant
> when small tasks are involved in the scheduling.
> 
> Let take the example of a task TA that is dequeued of its task group TG1.
> TA was the only task in TG1 which becomes idle.
> 
> We have the sequence:
> 
> - dequeue_entity TA->se
>     - update_load_avg(TA->se)
>     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
>     - account_entity_dequeue(TG1->cfs_rq, TA->se)
>           TG1->cfs_rq->load.weight = 0
>     - update_cfs_shares(TG1->cfs_rq)
> 	        TG1->se->load.weight is updated with the new share of
> 		cfs_rq. TG1->se->load.weight = 0.
> - dequeue_entity TG1->se
>     - update_load_avg(TG1->se) but its weight is now null so the last time
> slot (up to a tick) will be accounted with its new weight (0 in our case)
> instead of its real weight. The last time slot is accounted as an idle one
> whereas it was a running one.
> 
> If the running time of TA is short enough that no tick happens when it
> runs, all running time of TG1->se will be accounted as idle time.
> 
> Instead, we should update the share of a cfs_rq (in fact the weight of its
> group entity) only after having updated the load_avg of the group_entity.
> 
> update_cfs_shares() now takes the sched_entity as parameter instead of the
> cfs_rq and the weight of the group_entity is updated only once its load_avg
> has been synced with current time.

Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/

So the problem is that in our for_each_sched_entity(se) loop we end up
changing the next se before we get there.

		root
	      (cfs_rq)
		  \
		  (se)
		    A
		 (cfs_rq)
		      \
		      (se)
		       a

Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
updates A's se, which is the next se in our iteration and mucks with
state before we get there.

So you change update_cfs_shares() to go downward while we go upward,
ensuring we only update things that we've finished with.

Makes sense..

>  kernel/sched/fair.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18d9e75..19092fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>  
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  
> -static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +static void update_cfs_shares(struct sched_entity *se)
>  {
>  	struct task_group *tg;
> -	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq = group_cfs_rq(se);
>  	long shares;

please keep them ordered by length.

>  
> +	if (entity_is_task(se))

can be: !cfs_rq, which is the same and we already done that load.

> +		return;
> +
>  	tg = cfs_rq->tg;

This load isn't needed here yet, can be moved down a bit.

> -	se = tg->se[cpu_of(rq_of(cfs_rq))];
> -	if (!se || throttled_hierarchy(cfs_rq))
> +
> +	if (throttled_hierarchy(cfs_rq))
>  		return;
>  #ifndef CONFIG_SMP
>  	if (likely(se->load.weight == tg->shares))

> @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		se->vruntime += cfs_rq->min_vruntime;
>  
>  	update_load_avg(se, UPDATE_TG);
> +	update_cfs_shares(se);
>  	enqueue_entity_load_avg(cfs_rq, se);
>  	account_entity_enqueue(cfs_rq, se);
> -	update_cfs_shares(cfs_rq);
>  
>  	if (flags & ENQUEUE_WAKEUP)
>  		place_entity(cfs_rq, se, 0);

So here we need to update_cfs_shares() _before_ enqueue_entity, because
the update_cfs_shares() will affect this se's load, right?

> @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	/* return excess runtime on last dequeue */
>  	return_cfs_rq_runtime(cfs_rq);
>  
> -	update_cfs_shares(cfs_rq);
> +	update_cfs_shares(se);
>  
>  	/*
>  	 * Now advance min_vruntime if @se was the entity holding it back,

But this one hurts my brain..

It must be done after dequeue_entity_load_avg() such that we subtract
the load as was seen until now.

Could we please add comments explaining this ordering, because I forever
need to think about this (both enqueue and dequeue).

> @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	 * Ensure that runnable average is periodically updated.
>  	 */
>  	update_load_avg(curr, UPDATE_TG);
> -	update_cfs_shares(cfs_rq);
> +	update_cfs_shares(curr);
>  
>  #ifdef CONFIG_SCHED_HRTICK
>  	/*
> @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  			break;
>  
>  		update_load_avg(se, UPDATE_TG);
> -		update_cfs_shares(cfs_rq);
> +		update_cfs_shares(se);
>  	}
>  
>  	if (!se)
> @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  			break;
>  
>  		update_load_avg(se, UPDATE_TG);
> -		update_cfs_shares(cfs_rq);
> +		update_cfs_shares(se);
>  	}
>  
>  	if (!se)

This has a distinct pattern to it though; should we think about
something like: UPDATE_SHARES for update_load_avg() or does that confuse
things?

> @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  		/* Possible calls to update_curr() need rq clock */
>  		update_rq_clock(rq);
>  		for_each_sched_entity(se)
> -			update_cfs_shares(group_cfs_rq(se));
> +			update_cfs_shares(se);

Should we not also catch up with our load before we frob the shares?

>  		raw_spin_unlock_irqrestore(&rq->lock, flags);
>  	}
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html