Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Luis Machado <luis.machado@xxxxxxx> · Tue, 12 Mar 2024 09:45:57 +0000

On 3/11/24 17:05, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
>> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
>>> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
>>>> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
>>>>>> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
>>>>
>>>> -------- Summary --------
>>>>
>>>> In my (non-vhost experience) opinion the way to go would be either
>>>> replacing the cond_resched with a hard schedule or setting the
>>>> need_resched flag within vhost if the a data transfer was successfully
>>>> initiated. It will be necessary to check if this causes problems with
>>>> other workloads/benchmarks.
>>>
>>> Yes but conceptually I am still in the dark on whether the fact that
>>> periodically invoking cond_resched is no longer sufficient to be nice to
>>> others is a bug, or intentional.  So you feel it is intentional?
>>
>> I would assume that cond_resched is still a valid concept.
>> But, in this particular scenario we have the following problem:
>>
>> So far (with CFS) we had:
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> 3. CFS gives priority to woken up task and schedules it
>> 4. kworker runs
>>
>> Now (with EEVDF) we have:
>> 0. In some cases, kworker has accumulated negative lag 
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> -3a. EEVDF does not schedule kworker if it has negative lag
>> -4a. vhost continues running, kworker on same CPU starves
>> --
>> -3b. EEVDF schedules kworker if it has positive or no lag
>> -4b. kworker runs
>>
>> In the 3a/4a case, the kworker is given no chance to set the
>> necessary flag. The flag can only be set by another CPU now.
>> The schedule of the kworker was not caused by cond_resched, but
>> rather by the wakeup path of the scheduler.
>>
>> cond_resched works successfully once the load balancer (I suppose) 
>> decides to migrate the vhost off to another CPU. In that case, the
>> load balancer on another CPU sets that flag and we are good.
>> That then eventually allows the scheduler to pick kworker, but very
>> late.
> 
> Are we going anywhere with this btw?
> 
>

I think Tobias had a couple other threads related to this, with other potential fixes:

https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@xxxxxxxxxxxxx/

https://lore.kernel.org/lkml/20240228161023.14310-1-huschle@xxxxxxxxxxxxx/