Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Fri, 15 Mar 2024 06:31:51 -0400

On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > 
> > Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> > I would like however for some documentation to exist saying that if you
> > do abc then call API xyz. Then I hope we can feel a bit safer that
> > future scheduler changes will not break vhost (though as usual, nothing
> > is for sure).  Right now we are going by the documentation and that says
> > cond_resched so we do that.
> > 
> > -- 
> > MST
> > 
> 
> Here I'd like to add that we have two different problems:
> 
> 1. cond_resched not working as expected
>    This appears to me to be a bug in the scheduler where it lets the cgroup, 
>    which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
>    is allowed to surpass its own deadline without consequences. One of my RFCs
>    mentioned above adresses this issue (not happy yet with the implementation).
>    This issue only appears in that specific scenario, so it's not a general 
>    issue, rather a corner case.
>    But, this fix will still allow the vhost to reach its deadline, which is
>    one full time slice. This brings down the max delays from 300+ms to whatever
>    the timeslice is. This is not enough to fix the regression.
> 
> 2. vhost relying on kworker being scheduled on wake up
>    This is the bigger issue for the regression. There are rare cases, where
>    the vhost runs only for a very short amount of time before it wakes up 
>    the kworker. Simultaneously, the kworker takes longer than usual to 
>    complete its work and takes longer than the vhost did before. We
>    are talking 4digit to low 5digit nanosecond values.
>    With those two being the only tasks on the CPU, the scheduler now assumes
>    that the kworker wants to unfairly consume more than the vhost and denies
>    it being scheduled on wakeup.
>    In the regular cases, the kworker is faster than the vhost, so the 
>    scheduler assumes that the kworker needs help, which benefits the
>    scenario we are looking at.
>    In the bad case, this means unfortunately, that cond_resched cannot work
>    as good as before, for this particular case!
>    So, let's assume that problem 1 from above is fixed. It will take one 
>    full time slice to get the need_resched flag set by the scheduler
>    because vhost surpasses its deadline. Before, the scheduler cannot know
>    that the kworker should actually run. The kworker itself is unable
>    to communicate that by itself since it's not getting scheduled and there 
>    is no external entity that could intervene.
>    Hence my argumentation that cond_resched still works as expected. The
>    crucial part is that the wake up behavior has changed which is why I'm 
>    a bit reluctant to propose a documentation change on cond_resched.
>    I could see proposing a doc change, that cond_resched should not be
>    used if a task heavily relies on a woken up task being scheduled.

Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?

-- 
MST