On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote: > > Peter, would appreciate feedback on this. When is cond_resched() > insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst > be updated to require schedule() instead? > Happy new year everybody! I'd like to bring this thread back to life. To reiterate: - The introduction of the EEVDF scheduler revealed a performance regression in a uperf testcase of ~50%. - Tracing the scheduler showed that it takes decisions which are in line with its design. - The traces showed as well, that a vhost instance might run excessively long on its CPU in some circumstance. Those cause the performance regression as they cause delay times of 100+ms for a kworker which drives the actual network processing. - Before EEVDF, the vhost would always be scheduled off its CPU in favor of the kworker, as the kworker was being woken up and the former scheduler was giving more priority to the woken up task. With EEVDF, the kworker, as a long running process, is able to accumulate negative lag, which causes EEVDF to not prefer it on its wake up, leaving the vhost running. - If the kworker is not scheduled when being woken up, the vhost continues looping until it is migrated off the CPU. - The vhost offers to be scheduled off the CPU by calling cond_resched(), but, the the need_resched flag is not set, therefore cond_resched() does nothing. To solve this, I see the following options (might not be a complete nor a correct list) - Along with the wakeup of the kworker, need_resched needs to be set, such that cond_resched() triggers a reschedule. - The vhost calls schedule() instead of cond_resched() to give up the CPU. This would of course be a significantly stricter approach and might limit the performance of vhost in other cases. - Preventing the kworker from accumulating negative lag as it is mostly not runnable and if it runs, it only runs for a very short time frame. This might clash with the overall concept of EEVDF. - On cond_resched(), verify if the consumed runtime of the caller is outweighing the negative lag of another process (e.g. the kworker) and schedule the other process. Introduces overhead to cond_resched. I would be curious on feedback on those ideas and interested in alternative approaches.