On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote: > > Thanks a lot! To clarify it is not that I am opposed to changing vhost. > I would like however for some documentation to exist saying that if you > do abc then call API xyz. Then I hope we can feel a bit safer that > future scheduler changes will not break vhost (though as usual, nothing > is for sure). Right now we are going by the documentation and that says > cond_resched so we do that. > > -- > MST > Here I'd like to add that we have two different problems: 1. cond_resched not working as expected This appears to me to be a bug in the scheduler where it lets the cgroup, which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup is allowed to surpass its own deadline without consequences. One of my RFCs mentioned above adresses this issue (not happy yet with the implementation). This issue only appears in that specific scenario, so it's not a general issue, rather a corner case. But, this fix will still allow the vhost to reach its deadline, which is one full time slice. This brings down the max delays from 300+ms to whatever the timeslice is. This is not enough to fix the regression. 2. vhost relying on kworker being scheduled on wake up This is the bigger issue for the regression. There are rare cases, where the vhost runs only for a very short amount of time before it wakes up the kworker. Simultaneously, the kworker takes longer than usual to complete its work and takes longer than the vhost did before. We are talking 4digit to low 5digit nanosecond values. With those two being the only tasks on the CPU, the scheduler now assumes that the kworker wants to unfairly consume more than the vhost and denies it being scheduled on wakeup. In the regular cases, the kworker is faster than the vhost, so the scheduler assumes that the kworker needs help, which benefits the scenario we are looking at. In the bad case, this means unfortunately, that cond_resched cannot work as good as before, for this particular case! So, let's assume that problem 1 from above is fixed. It will take one full time slice to get the need_resched flag set by the scheduler because vhost surpasses its deadline. Before, the scheduler cannot know that the kworker should actually run. The kworker itself is unable to communicate that by itself since it's not getting scheduled and there is no external entity that could intervene. Hence my argumentation that cond_resched still works as expected. The crucial part is that the wake up behavior has changed which is why I'm a bit reluctant to propose a documentation change on cond_resched. I could see proposing a doc change, that cond_resched should not be used if a task heavily relies on a woken up task being scheduled.