On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote: > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote: > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote: > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote: > > > - Along with the wakeup of the kworker, need_resched needs to > > > be set, such that cond_resched() triggers a reschedule. > > > > Let's try this? Does not look like discussing vhost itself will > > draw attention from scheduler guys but posting a scheduling > > patch probably will? Can you post a patch? > > As a baseline, I verified that the following two options fix > the regression: > > - replacing the cond_resched in the vhost_worker function with a hard > schedule > - setting the need_resched flag using set_tsk_need_resched(current) > right before calling cond_resched > > I then tried to find a better spot to put the set_tsk_need_resched > call. > > One approach I found to be working is setting the need_resched flag > at the end of handle_tx and hande_rx. > This would be after data has been actually passed to the socket, so > the originally blocked kworker has something to do and will profit > from the reschedule. > It might be possible to go deeper and place the set_tsk_need_resched > call to the location right after actually passing the data, but this > might leave us with sprinkling that call in multiple places and > might be too intrusive. > Furthermore, it might be possible to check if an error occured when > preparing the transmission and then skip the setting of the flag. > > This would require a conceptual decision on the vhost side. > This solution would not touch the scheduler, only incentivise it to > do the right thing for this particular regression. > > Another idea could be to find the counterpart that initiates the > actual data transfer, which I assume wakes up the kworker. From > what I gather it seems to be an eventfd notification that ends up > somewhere in the qemu code. Not sure if that context would allow > to set the need_resched flag, nor whether this would be a good idea. > > > > > > - On cond_resched(), verify if the consumed runtime of the caller > > > is outweighing the negative lag of another process (e.g. the > > > kworker) and schedule the other process. Introduces overhead > > > to cond_resched. > > > > Or this last one. > > On cond_resched itself, this will probably only be possible in a very > very hacky way. That is because currently, there is no immidiate access > to the necessary data available, which would make it necessary to > bloat up the cond_resched function quite a bit, with a probably > non-negligible amount of overhead. > > Changing other aspects in the scheduler might get us in trouble as > they all would probably resolve back to the question "What is the magic > value that determines whether a small task not being scheduled justifies > setting the need_resched flag for a currently running task or adjusting > its lag?". As this would then also have to work for all non-vhost related > cases, this looks like a dangerous path to me on second thought. > > > -------- Summary -------- > > In my (non-vhost experience) opinion the way to go would be either > replacing the cond_resched with a hard schedule or setting the > need_resched flag within vhost if the a data transfer was successfully > initiated. It will be necessary to check if this causes problems with > other workloads/benchmarks. Yes but conceptually I am still in the dark on whether the fact that periodically invoking cond_resched is no longer sufficient to be nice to others is a bug, or intentional. So you feel it is intentional? I propose a two patch series then: patch 1: in this text in Documentation/kernel-hacking/hacking.rst If you're doing longer computations: first think userspace. If you **really** want to do it in kernel you should regularly check if you need to give up the CPU (remember there is cooperative multitasking per CPU). Idiom:: cond_resched(); /* Will sleep */ replace cond_resched -> schedule Since apparently cond_resched is no longer sufficient to make the scheduler check whether you need to give up the CPU. patch 2: make this change for vhost. WDYT? -- MST