It took me a while, but I was able to figure out why EEVDF behaves different then CFS does. I'm still waiting for some official confirmation of my assumptions but it all seems very plausible to me. Leaving aside all the specifics of vhost and kworkers, a more general description of the scenario would be as follows: Assume that we have two tasks taking turns on a single CPU. Task 1 does something and wakes up Task 2. Task 2 does something and goes to sleep. And we're just repeating that. Task 1 and task 2 only run for very short amounts of time, i.e. much shorter than a regular time slice (vhost = task1, kworker = task2). Let's further assume, that task 1 runs longer than task 2. In CFS, this means, that vruntime of task 1 starts to outrun the vruntime of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2 always gets picked on wake up because it has the smaller vruntime. In EEVDF, this would translate to a permanent positive lag, which also causes task 2 to get consistently scheduled on wake up. Let's now assume, that ocassionally, task 2 runs a little bit longer than task 1. In CFS, this means, that task 2 can close the vruntime gap by a bit, but, it can easily remain below the value of task 1. Task 2 would still get picked on wake up. With EEVDF, in its current form, task 2 will now get a negative lag, which in turn, will cause it not being picked on the next wake up. So, it seems we have a change in the level of how far the both variants look into the past. CFS being willing to take more history into account, whereas EEVDF does not (with update_entity_lag setting the lag value from scratch, and place_entity not taking the original vruntime into account). All of this can be seen as correct by design, a task consumes more time than the others, so it has to give way to others. The big difference is now, that CFS allowed a task to collect some bonus by constantly using less CPU time than others and trading that time against ocassionally taking more CPU time. EEVDF could do the same thing, by allowing the accumulation of positive lag, which can then be traded against the one time the task would get negative lag. This might clash with other EEVDF assumptions though. The patch below fixes the degredation, but is not at all aligned with what EEVDF wants to achieve, but it helps as an indicator that my hypothesis is correct. So, what does this now mean for the vhost regression we were discussing? 1. The behavior of the scheduler changed with regard to wake-up scenarios. 2. vhost in its current form relies on the way how CFS works by assuming that the kworker always gets scheduled. I would like to argue that it therefore makes sense to reconsider the vhost implementation to make it less dependent on the internals of the scheduler. As proposed earlier in this thread, I see two options: 1. Do an explicit schedule() after every iteration across the vhost queues 2. Set the need_resched flag after writing to the socket that would trigger eventfd and the underlying kworker Both options would make sure that the vhost gives up the CPU as it cannot continue anyway without the kworker handling the event. Option 1 will give up the CPU regardless of whether something was found in the queues, whereas option 2 would only give up the CPU if there is. It shall be noted, that we encountered similar behavior when running some fio benchmarks. From a brief glance at the code, I was seeing similar intentions: Loop over queues, then trigger an action through some event mechanism. Applying the same patch as mentioned above also fixes this issue. It could be argued, that this is still something that needs to be somehow addressed by the scheduler since it might affect others as well and there are in fact patches coming in. Will they address our issue here? Not sure yet. On the other hand, it might just be beneficial to make vhost more resilient towards the scheduler's algorithm by not relying on a certain behavior in the wakeup path. Further discussion on additional commits to make EEVDF work correctly can be found here: https://lore.kernel.org/lkml/20240408090639.GD21904@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/ So far these patches do not fix the degredation. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03be0d1330a6..b83a72311d2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 lag, limit; SCHED_WARN_ON(!se->on_rq); - lag = avg_vruntime(cfs_rq) - se->vruntime; + lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime; limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); se->vlag = clamp(lag, -limit, limit);