On Mon, 27 Feb 2023 at 14:29, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote: > > On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > > > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote: > > > > > > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > > > +{ > > > > + > > > > + /* Only latency sensitive entity can be added to the list */ > > > > + if (se->latency_offset >= 0) > > > > + return; > > > > + > > > > + if (!RB_EMPTY_NODE(&se->latency_node)) > > > > + return; > > > > + > > > > + /* > > > > + * An execution time less than sysctl_sched_min_granularity means that > > > > + * the entity has been preempted by a higher sched class or an entity > > > > + * with higher latency constraint. > > > > + * Put it back in the list so it gets a chance to run 1st during the > > > > + * next slice. > > > > + */ > > > > + if (!(flags & ENQUEUE_WAKEUP)) { > > > > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; > > > > + > > > > + if (delta_exec >= sysctl_sched_min_granularity) > > > > + return; > > > > + } > > > > > > I'm not a big fan of this dynamic enqueueing condition; it makes it > > > rather hard to interpret the below addition to pick_next_entity(). > > > > > > Let me think about this more... at the very least the comment with > > > __pick_first_latency() use below needs to be expanded upon if we keep it > > > like so. > > > > Only the waking tasks should be added in the latency rb tree so they > > But that's what I'm saying, you can game this by doing super short > sleeps every min_gran. yes, it 's for a time. But i'm not sure this will be always beneficial at the end because most of the time you will have your full slice without doing this game > > > can be selected to run 1st (as long as they don't use too much > > runtime). But task A can wake up, preempts current task B thanks to > > its latency nice , starts to run few usecs but then is immediately > > preempted by a RT task C as an example. In this case, we consider that > > the task A didn't get a chance to run after its wakeup and we put it > > back to the latency rb tree just as if task A has just woken up but > > didn't preempted the new current task C. > > So ideally, and this is where I'm very slow with thinking, that > wakeup_preempt_entity() condition here: > > > > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > > > se = cfs_rq->last; > > > > } > > > > > > > > + /* Check for latency sensitive entity waiting for running */ > > > > + latency = __pick_first_latency(cfs_rq); > > > > + if (latency && (latency != se) && > > > > + wakeup_preempt_entity(latency, se) < 1) > > > > + se = latency; > > should be sufficient to provide fair bandwidth usage. The EEVDF paper > achieves this by selecting the leftmost elegible task, where elegibility > is dependent on negative lag. Only those tasks that are behind the pack > are allowed runtime. > > Now clearly our min_vruntime is unsuited for that exact scheme, but iirc > wake_preempt_entity() does not allow for starvation, so we should be > good, even without that weird condition in __enqueue_latency(), hmm? If we unconditionally __enqueue_latency() the task then it can end up providing more bandwidth to those tasks because it's like having a larger sleep credit than others. The original condition in __enqueue_latency() was : if (!(flags & ENQUEUE_WAKEUP)) { return; } So that task gets a chance to preempt others only at wakeup. But then, I have seen such tasks being preempted immediately but RT tasks and as a result lost their latency advantage. Maybe I should be the condition above and add the weird condition in a separate patch with dedicated figures