On 4 January 2017 at 18:20, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote: > On 21/12/16 15:50, Vincent Guittot wrote: > > IMHO, the overall idea makes sense to me. Just a couple of small > questions ... > >> The update of the share of a cfs_rq is done when its load_avg is updated >> but before the group_entity's load_avg has been updated for the past time >> slot. This generates wrong load_avg accounting which can be significant >> when small tasks are involved in the scheduling. > > Why for small tasks? Is it because we use load = > scale_load_down(cfs_rq->load.weight) and not load = cfs_rq->avg.load_avg > in calc_cfs_shares()? small tasks are significantly impacted because the tick has less chance to fire while task is running and as a result less chance to update group entity's load_avg of the past tick with the correct weight and minimize the impact of this bug. > >> Let take the example of a task a that is dequeued of its task group A: >> root >> (cfs_rq) >> \ >> (se) >> A >> (cfs_rq) >> \ >> (se) >> a >> >> Task "a" was the only task in task group A which becomes idle when a is >> dequeued. >> >> We have the sequence: >> >> - dequeue_entity a->se >> - update_load_avg(a->se) >> - dequeue_entity_load_avg(A->cfs_rq, a->se) >> - update_cfs_shares(A->cfs_rq) >> A->cfs_rq->load.weight == 0 > > You mean A->cfs_rq->load.weight = 0 ? no A->cfs_rq->load.weight == 0. Just to point out that A->cfs_rq->load.weight equals 0 at that time. A->cfs_rq->load.weight has been set to 0 in account_entity_dequeue() which is not described here and happen between dequeue_entity_load_avg() and update_cfs_shares() > >> A->se->load.weight is updated with the new share (0 in this case) > > Shouldn't this be MIN_SHARES (2) instead of 0? yes you're right I had in mind scale_load_down(se->load.weight) which is used when calling __update_load_avg() in update_load_avg() and which is 0 on the 64bits platform that I used > >> - dequeue_entity A->se >> - update_load_avg(A->se) but its weight is now null so the last time >> slot (up to a tick) will be accounted with a weight of 0 instead of >> its real weight during the time slot. The last time slot will be >> accounted as an idle one whereas it was a running one. >> >> If the running time of task a is short enough that no tick happens when it >> runs, all running time of group entity A->se will be accounted as idle >> time. >> >> Instead, we should update the share of a cfs_rq (in fact the weight of its >> group entity) only after having updated the load_avg of the group_entity. > > This is because we use 'se->on_rq * scale_load_down(se->load.weight)' in > __update_load_avg() as weight parameter for PELT load_avg update? yes > >> update_cfs_shares() now take the sched_entity as parameter instead of the >> cfs_rq and the weight of the group_entity is updated only once its load_avg >> has been synced with current time. > > [...] > -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html