Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Tobias Huschle <huschle@xxxxxxxxxxxxx> · Tue, 30 Apr 2024 12:50:05 +0200

It took me a while, but I was able to figure out why EEVDF behaves 
different then CFS does. I'm still waiting for some official confirmation
of my assumptions but it all seems very plausible to me.

Leaving aside all the specifics of vhost and kworkers, a more general
description of the scenario would be as follows:

Assume that we have two tasks taking turns on a single CPU. 
Task 1 does something and wakes up Task 2.
Task 2 does something and goes to sleep.
And we're just repeating that.
Task 1 and task 2 only run for very short amounts of time, i.e. much 
shorter than a regular time slice (vhost = task1, kworker = task2).

Let's further assume, that task 1 runs longer than task 2. 
In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
always gets picked on wake up because it has the smaller vruntime. 
In EEVDF, this would translate to a permanent positive lag, which also 
causes task 2 to get consistently scheduled on wake up.

Let's now assume, that ocassionally, task 2 runs a little bit longer than
task 1. In CFS, this means, that task 2 can close the vruntime gap by a
bit, but, it can easily remain below the value of task 1. Task 2 would 
still get picked on wake up.
With EEVDF, in its current form, task 2 will now get a negative lag, which
in turn, will cause it not being picked on the next wake up.

So, it seems we have a change in the level of how far the both variants look 
into the past. CFS being willing to take more history into account, whereas
EEVDF does not (with update_entity_lag setting the lag value from scratch, 
and place_entity not taking the original vruntime into account).

All of this can be seen as correct by design, a task consumes more time
than the others, so it has to give way to others. The big difference
is now, that CFS allowed a task to collect some bonus by constantly using 
less CPU time than others and trading that time against ocassionally taking
more CPU time. EEVDF could do the same thing, by allowing the accumulation
of positive lag, which can then be traded against the one time the task
would get negative lag. This might clash with other EEVDF assumptions though.

The patch below fixes the degredation, but is not at all aligned with what 
EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
correct.

So, what does this now mean for the vhost regression we were discussing?

1. The behavior of the scheduler changed with regard to wake-up scenarios.
2. vhost in its current form relies on the way how CFS works by assuming 
   that the kworker always gets scheduled.

I would like to argue that it therefore makes sense to reconsider the vhost
implementation to make it less dependent on the internals of the scheduler.
As proposed earlier in this thread, I see two options:

1. Do an explicit schedule() after every iteration across the vhost queues
2. Set the need_resched flag after writing to the socket that would trigger
   eventfd and the underlying kworker

Both options would make sure that the vhost gives up the CPU as it cannot
continue anyway without the kworker handling the event. Option 1 will give
up the CPU regardless of whether something was found in the queues, whereas
option 2 would only give up the CPU if there is.

It shall be noted, that we encountered similar behavior when running some
fio benchmarks. From a brief glance at the code, I was seeing similar
intentions: Loop over queues, then trigger an action through some event
mechanism. Applying the same patch as mentioned above also fixes this issue.

It could be argued, that this is still something that needs to be somehow
addressed by the scheduler since it might affect others as well and there 
are in fact patches coming in. Will they address our issue here? Not sure yet.
On the other hand, it might just be beneficial to make vhost more resilient
towards the scheduler's algorithm by not relying on a certain behavior in
the wakeup path.
Further discussion on additional commits to make EEVDF work correctly can 
be found here: 
https://lore.kernel.org/lkml/20240408090639.GD21904@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/
So far these patches do not fix the degredation.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
        s64 lag, limit;
 
        SCHED_WARN_ON(!se->on_rq);
-       lag = avg_vruntime(cfs_rq) - se->vruntime;
+       lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
        limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
        se->vlag = clamp(lag, -limit, limit);