On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote: > > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > +{ > > + > > + /* Only latency sensitive entity can be added to the list */ > > + if (se->latency_offset >= 0) > > + return; > > + > > + if (!RB_EMPTY_NODE(&se->latency_node)) > > + return; > > + > > + /* > > + * An execution time less than sysctl_sched_min_granularity means that > > + * the entity has been preempted by a higher sched class or an entity > > + * with higher latency constraint. > > + * Put it back in the list so it gets a chance to run 1st during the > > + * next slice. > > + */ > > + if (!(flags & ENQUEUE_WAKEUP)) { > > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > + > > + if (delta_exec >= sysctl_sched_min_granularity) > > + return; > > + } > > I'm not a big fan of this dynamic enqueueing condition; it makes it > rather hard to interpret the below addition to pick_next_entity(). > > Let me think about this more... at the very least the comment with > __pick_first_latency() use below needs to be expanded upon if we keep it > like so. Only the waking tasks should be added in the latency rb tree so they can be selected to run 1st (as long as they don't use too much runtime). But task A can wake up, preempts current task B thanks to its latency nice , starts to run few usecs but then is immediately preempted by a RT task C as an example. In this case, we consider that the task A didn't get a chance to run after its wakeup and we put it back to the latency rb tree just as if task A has just woken up but didn't preempted the new current task C. I have used sysctl_sched_min_granularity has this is min runtime for a task before being possibly preempted at tick by another cfs task with a lower vruntime so if it runs less than sysctl_sched_min_granularity we are sure that it has been preempted by higher prio tasks and it's not because it used all its runtime compared to others > > > + > > + rb_add_cached(&se->latency_node, &cfs_rq->latency_timeline, __latency_less); > > +} > > > @@ -4966,7 +5040,7 @@ static struct sched_entity * > > pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > { > > struct sched_entity *left = __pick_first_entity(cfs_rq); > > - struct sched_entity *se; > > + struct sched_entity *latency, *se; > > > > /* > > * If curr is set we have to see if its left of the leftmost entity > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > se = cfs_rq->last; > > } > > > > + /* Check for latency sensitive entity waiting for running */ > > + latency = __pick_first_latency(cfs_rq); > > + if (latency && (latency != se) && > > + wakeup_preempt_entity(latency, se) < 1) > > + se = latency; > > I'm not quite sure why this condition isn't sufficient on it's own. > After all, if a task does a 'spurious' nanosleep it can get around the > 'restriction' in __enqueue_latency() without any great penalty to it's > actual bandwidth consumption. Yes it until it used all its runtime.