Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Tobias Huschle <huschle@xxxxxxxxxxxxx> · Thu, 1 Feb 2024 12:47:39 +0100

On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > -------- Summary --------
> > 
> > In my (non-vhost experience) opinion the way to go would be either
> > replacing the cond_resched with a hard schedule or setting the
> > need_resched flag within vhost if the a data transfer was successfully
> > initiated. It will be necessary to check if this causes problems with
> > other workloads/benchmarks.
> 
> Yes but conceptually I am still in the dark on whether the fact that
> periodically invoking cond_resched is no longer sufficient to be nice to
> others is a bug, or intentional.  So you feel it is intentional?

I would assume that cond_resched is still a valid concept.
But, in this particular scenario we have the following problem:

So far (with CFS) we had:
1. vhost initiates data transfer
2. kworker is woken up
3. CFS gives priority to woken up task and schedules it
4. kworker runs

Now (with EEVDF) we have:
0. In some cases, kworker has accumulated negative lag 
1. vhost initiates data transfer
2. kworker is woken up
-3a. EEVDF does not schedule kworker if it has negative lag
-4a. vhost continues running, kworker on same CPU starves
--
-3b. EEVDF schedules kworker if it has positive or no lag
-4b. kworker runs

In the 3a/4a case, the kworker is given no chance to set the
necessary flag. The flag can only be set by another CPU now.
The schedule of the kworker was not caused by cond_resched, but
rather by the wakeup path of the scheduler.

cond_resched works successfully once the load balancer (I suppose) 
decides to migrate the vhost off to another CPU. In that case, the
load balancer on another CPU sets that flag and we are good.
That then eventually allows the scheduler to pick kworker, but very
late.

> I propose a two patch series then:
> 
> patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
>     cond_resched(); /* Will sleep */
> 
> 
> replace cond_resched -> schedule
> 
> 
> Since apparently cond_resched is no longer sufficient to
> make the scheduler check whether you need to give up the CPU.
> 
> patch 2: make this change for vhost.
> 
> WDYT?

For patch 1, I would like to see some feedback from Peter (or someone else
from the scheduler maintainers).
For patch 2, I would prefer to do some more testing first if this might have
an negative effect on other benchmarks.

I also stumbled upon something in the scheduler code that I want to verify.
Maybe a cgroup thing, will check that out again.

I'll do some more testing with the cond_resched->schedule fix, check the
cgroup thing and wait for Peter then.
Will get back if any of the above yields some results.

> 
> -- 
> MST
> 
>