Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Tobias Huschle <huschle@xxxxxxxxxxxxx> · Thu, 1 Feb 2024 08:38:43 +0100

On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?

As a baseline, I verified that the following two options fix
the regression:

- replacing the cond_resched in the vhost_worker function with a hard
  schedule 
- setting the need_resched flag using set_tsk_need_resched(current)
  right before calling cond_resched

I then tried to find a better spot to put the set_tsk_need_resched
call. 

One approach I found to be working is setting the need_resched flag 
at the end of handle_tx and hande_rx.
This would be after data has been actually passed to the socket, so 
the originally blocked kworker has something to do and will profit
from the reschedule. 
It might be possible to go deeper and place the set_tsk_need_resched
call to the location right after actually passing the data, but this
might leave us with sprinkling that call in multiple places and
might be too intrusive.
Furthermore, it might be possible to check if an error occured when
preparing the transmission and then skip the setting of the flag.

This would require a conceptual decision on the vhost side.
This solution would not touch the scheduler, only incentivise it to
do the right thing for this particular regression.

Another idea could be to find the counterpart that initiates the
actual data transfer, which I assume wakes up the kworker. From
what I gather it seems to be an eventfd notification that ends up
somewhere in the qemu code. Not sure if that context would allow
to set the need_resched flag, nor whether this would be a good idea.

> 
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.

On cond_resched itself, this will probably only be possible in a very 
very hacky way. That is because currently, there is no immidiate access
to the necessary data available, which would make it necessary to 
bloat up the cond_resched function quite a bit, with a probably 
non-negligible amount of overhead.

Changing other aspects in the scheduler might get us in trouble as
they all would probably resolve back to the question "What is the magic
value that determines whether a small task not being scheduled justifies
setting the need_resched flag for a currently running task or adjusting 
its lag?". As this would then also have to work for all non-vhost related
cases, this looks like a dangerous path to me on second thought.

-------- Summary --------

In my (non-vhost experience) opinion the way to go would be either
replacing the cond_resched with a hard schedule or setting the
need_resched flag within vhost if the a data transfer was successfully
initiated. It will be necessary to check if this causes problems with
other workloads/benchmarks.