Virtual machine has cgroup hierarchies as follow: root | vm_tg (cfs_rq) / \ (se) (se) tg_A tg_B (cfs_rq) (cfs_rq) / \ (se) (se) a b 'a' and 'b' are two vcpus of the VM. We set cfs quota on vm_tg, and the schedule latency of vcpu(a/b) may become very large, up to more than 2S. We use perf sched to capture the latency ( perf sched record -a sleep 10; perf sched lat -p --sort=max ) and the result is as follow: Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | ------------------------------------------------------------------------ CPU 0/KVM| 260.261 ms | 50 | avg: 82.017 ms | max: 2510.990 ms | ... We test the latest kernel and the result is the same. We add some tracepoints, found the following sequence will cause the issue: 1) 'a' is only task of tg_A, when 'a' go to sleep (e.g. vcpu halt), tg_A is dequeued, and tg_A->se->load.weight = MIN_SHARES. 2) 'b' continue running, then trigger throttle. tg_A->cfs_rq->throttle_count=1 3) Something wakeup 'a' (e.g. vcpu receive a virq). When enqueue tg_A, tg_A->se->load.weight can't be updated because tg_A->cfs_rq->throttle_count=1 4) After one cfs quota period, vm_tg is unthrottled 5) 'a' is running 6) After one tick, when update tg_A->se's vruntime, tg_A->se->load.weight is still MIN_SHARES, lead tg_A->se's vruntime has grown a large value. 7) That will cause 'a' to have a large schedule latency. We *rudely* remove the check which cause tg_A->se->load.weight didn't reweight in step-3 as follow and the problem disappear: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2f0a0be..348ccd6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3016,9 +3016,6 @@ static void update_cfs_group(struct sched_entity *se) if (!gcfs_rq) return; - if (throttled_hierarchy(gcfs_rq)) - return; - #ifndef CONFIG_SMP runnable = shares = READ_ONCE(gcfs_rq->tg->shares); So do guys you have any suggestion on this problem ? Is there a better way fix this problem ? -- Regards, Longpeng(Mike)