Re: [PATCH v10 8/9] sched/fair: Add latency list

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Mon, 27 Feb 2023 14:29:27 +0100

On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote:
> On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> >
> > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > +{
> > > +
> > > +     /* Only latency sensitive entity can be added to the list */
> > > +     if (se->latency_offset >= 0)
> > > +             return;
> > > +
> > > +     if (!RB_EMPTY_NODE(&se->latency_node))
> > > +             return;
> > > +
> > > +     /*
> > > +      * An execution time less than sysctl_sched_min_granularity means that
> > > +      * the entity has been preempted by a higher sched class or an entity
> > > +      * with higher latency constraint.
> > > +      * Put it back in the list so it gets a chance to run 1st during the
> > > +      * next slice.
> > > +      */
> > > +     if (!(flags & ENQUEUE_WAKEUP)) {
> > > +             u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > +
> > > +             if (delta_exec >= sysctl_sched_min_granularity)
> > > +                     return;
> > > +     }
> >
> > I'm not a big fan of this dynamic enqueueing condition; it makes it
> > rather hard to interpret the below addition to pick_next_entity().
> >
> > Let me think about this more... at the very least the comment with
> > __pick_first_latency() use below needs to be expanded upon if we keep it
> > like so.
>
> Only the waking tasks should be added in the latency rb tree so they

But that's what I'm saying, you can game this by doing super short
sleeps every min_gran.

> can be selected to run 1st (as long as they don't use too much
> runtime). But task A can wake up, preempts current task B thanks to
> its latency nice , starts to run few usecs but then is immediately
> preempted by a RT task C as an example. In this case, we consider that
> the task A didn't get a chance to run after its wakeup and we put it
> back to the latency rb tree just as if task A has just woken up but
> didn't preempted the new current task C.

So ideally, and this is where I'm very slow with thinking, that
wakeup_preempt_entity() condition here:

> > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >               se = cfs_rq->last;
> > >       }
> > >
> > > +     /* Check for latency sensitive entity waiting for running */
> > > +     latency = __pick_first_latency(cfs_rq);
> > > +     if (latency && (latency != se) &&
> > > +         wakeup_preempt_entity(latency, se) < 1)
> > > +             se = latency;

should be sufficient to provide fair bandwidth usage. The EEVDF paper
achieves this by selecting the leftmost elegible task, where elegibility
is dependent on negative lag. Only those tasks that are behind the pack
are allowed runtime.

Now clearly our min_vruntime is unsuited for that exact scheme, but iirc
wake_preempt_entity() does not allow for starvation, so we should be
good, even without that weird condition in __enqueue_latency(), hmm?