On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote: > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote: > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote: > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote: > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote: > > > > -------- Summary -------- > > > > In my (non-vhost experience) opinion the way to go would be either > > replacing the cond_resched with a hard schedule or setting the > > need_resched flag within vhost if the a data transfer was successfully > > initiated. It will be necessary to check if this causes problems with > > other workloads/benchmarks. > > Yes but conceptually I am still in the dark on whether the fact that > periodically invoking cond_resched is no longer sufficient to be nice to > others is a bug, or intentional. So you feel it is intentional? I would assume that cond_resched is still a valid concept. But, in this particular scenario we have the following problem: So far (with CFS) we had: 1. vhost initiates data transfer 2. kworker is woken up 3. CFS gives priority to woken up task and schedules it 4. kworker runs Now (with EEVDF) we have: 0. In some cases, kworker has accumulated negative lag 1. vhost initiates data transfer 2. kworker is woken up -3a. EEVDF does not schedule kworker if it has negative lag -4a. vhost continues running, kworker on same CPU starves -- -3b. EEVDF schedules kworker if it has positive or no lag -4b. kworker runs In the 3a/4a case, the kworker is given no chance to set the necessary flag. The flag can only be set by another CPU now. The schedule of the kworker was not caused by cond_resched, but rather by the wakeup path of the scheduler. cond_resched works successfully once the load balancer (I suppose) decides to migrate the vhost off to another CPU. In that case, the load balancer on another CPU sets that flag and we are good. That then eventually allows the scheduler to pick kworker, but very late. > I propose a two patch series then: > > patch 1: in this text in Documentation/kernel-hacking/hacking.rst > > If you're doing longer computations: first think userspace. If you > **really** want to do it in kernel you should regularly check if you need > to give up the CPU (remember there is cooperative multitasking per CPU). > Idiom:: > > cond_resched(); /* Will sleep */ > > > replace cond_resched -> schedule > > > Since apparently cond_resched is no longer sufficient to > make the scheduler check whether you need to give up the CPU. > > patch 2: make this change for vhost. > > WDYT? For patch 1, I would like to see some feedback from Peter (or someone else from the scheduler maintainers). For patch 2, I would prefer to do some more testing first if this might have an negative effect on other benchmarks. I also stumbled upon something in the scheduler code that I want to verify. Maybe a cgroup thing, will check that out again. I'll do some more testing with the cond_resched->schedule fix, check the cgroup thing and wait for Peter then. Will get back if any of the above yields some results. > > -- > MST > >