On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote: > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote: > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote: > > - Along with the wakeup of the kworker, need_resched needs to > > be set, such that cond_resched() triggers a reschedule. > > Let's try this? Does not look like discussing vhost itself will > draw attention from scheduler guys but posting a scheduling > patch probably will? Can you post a patch? As a baseline, I verified that the following two options fix the regression: - replacing the cond_resched in the vhost_worker function with a hard schedule - setting the need_resched flag using set_tsk_need_resched(current) right before calling cond_resched I then tried to find a better spot to put the set_tsk_need_resched call. One approach I found to be working is setting the need_resched flag at the end of handle_tx and hande_rx. This would be after data has been actually passed to the socket, so the originally blocked kworker has something to do and will profit from the reschedule. It might be possible to go deeper and place the set_tsk_need_resched call to the location right after actually passing the data, but this might leave us with sprinkling that call in multiple places and might be too intrusive. Furthermore, it might be possible to check if an error occured when preparing the transmission and then skip the setting of the flag. This would require a conceptual decision on the vhost side. This solution would not touch the scheduler, only incentivise it to do the right thing for this particular regression. Another idea could be to find the counterpart that initiates the actual data transfer, which I assume wakes up the kworker. From what I gather it seems to be an eventfd notification that ends up somewhere in the qemu code. Not sure if that context would allow to set the need_resched flag, nor whether this would be a good idea. > > > - On cond_resched(), verify if the consumed runtime of the caller > > is outweighing the negative lag of another process (e.g. the > > kworker) and schedule the other process. Introduces overhead > > to cond_resched. > > Or this last one. On cond_resched itself, this will probably only be possible in a very very hacky way. That is because currently, there is no immidiate access to the necessary data available, which would make it necessary to bloat up the cond_resched function quite a bit, with a probably non-negligible amount of overhead. Changing other aspects in the scheduler might get us in trouble as they all would probably resolve back to the question "What is the magic value that determines whether a small task not being scheduled justifies setting the need_resched flag for a currently running task or adjusting its lag?". As this would then also have to work for all non-vhost related cases, this looks like a dangerous path to me on second thought. -------- Summary -------- In my (non-vhost experience) opinion the way to go would be either replacing the cond_resched with a hard schedule or setting the need_resched flag within vhost if the a data transfer was successfully initiated. It will be necessary to check if this causes problems with other workloads/benchmarks.