On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote: > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote: > > > > Thanks a lot! To clarify it is not that I am opposed to changing vhost. > > I would like however for some documentation to exist saying that if you > > do abc then call API xyz. Then I hope we can feel a bit safer that > > future scheduler changes will not break vhost (though as usual, nothing > > is for sure). Right now we are going by the documentation and that says > > cond_resched so we do that. > > > > -- > > MST > > > > Here I'd like to add that we have two different problems: > > 1. cond_resched not working as expected > This appears to me to be a bug in the scheduler where it lets the cgroup, > which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup > is allowed to surpass its own deadline without consequences. One of my RFCs > mentioned above adresses this issue (not happy yet with the implementation). > This issue only appears in that specific scenario, so it's not a general > issue, rather a corner case. > But, this fix will still allow the vhost to reach its deadline, which is > one full time slice. This brings down the max delays from 300+ms to whatever > the timeslice is. This is not enough to fix the regression. > > 2. vhost relying on kworker being scheduled on wake up > This is the bigger issue for the regression. There are rare cases, where > the vhost runs only for a very short amount of time before it wakes up > the kworker. Simultaneously, the kworker takes longer than usual to > complete its work and takes longer than the vhost did before. We > are talking 4digit to low 5digit nanosecond values. > With those two being the only tasks on the CPU, the scheduler now assumes > that the kworker wants to unfairly consume more than the vhost and denies > it being scheduled on wakeup. > In the regular cases, the kworker is faster than the vhost, so the > scheduler assumes that the kworker needs help, which benefits the > scenario we are looking at. > In the bad case, this means unfortunately, that cond_resched cannot work > as good as before, for this particular case! > So, let's assume that problem 1 from above is fixed. It will take one > full time slice to get the need_resched flag set by the scheduler > because vhost surpasses its deadline. Before, the scheduler cannot know > that the kworker should actually run. The kworker itself is unable > to communicate that by itself since it's not getting scheduled and there > is no external entity that could intervene. > Hence my argumentation that cond_resched still works as expected. The > crucial part is that the wake up behavior has changed which is why I'm > a bit reluctant to propose a documentation change on cond_resched. > I could see proposing a doc change, that cond_resched should not be > used if a task heavily relies on a woken up task being scheduled. Could you remind me pls, what is the kworker doing specifically that vhost is relying on? -- MST