On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote: > On 11/27/23 9:56 PM, Tobias Huschle Wrote: > > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote: > > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote: [...] > > What are the weights of the two entities? > Both entities have the same weights (I saw 1048576 for both of them). The story looks different when we look at the cgroup hierarchy though: sew := weight of the sched entity (se->load.weight) CPU 6/KVM-2360 [011] d.... 1158.884473: sched_place: comm=vhost-2961 pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0 CPU 6/KVM-2360 [011] d.... 1158.884473: sched_place: comm= pid=0 sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0 CPU 6/KVM-2360 [011] d.... 1158.884474: sched_place: comm= pid=0 sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0 CPU 6/KVM-2360 [011] d.... 1158.884474: sched_place: comm= pid=0 sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0 CPU 6/KVM-2360 [011] d.... 1158.884474: sched_place: comm= pid=0 sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 cpu=11 nr=0 vru=76600751247 lag=0 <...> vhost-2961-2984 [011] d.... 1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0 Here we can see the following weights: kworker -> 1048576 vhost -> 1048576 cgroup root -> 3876 kworker and vhost weights remain the same. The weights of the nodes in the cgroup vary. I also spent some more thought on this and have some more observations: 1. kworker lag after short runtime vhost-2961-2984 [011] d.... 1158.884486: sched_waking: comm=kworker/11:2 pid=202 prio=120 target_cpu=011 vhost-2961-2984 [011] d.... 1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0 <...> ^^^^^ vhost-2961-2984 [011] d.... 1158.884490: sched_switch: prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> next_comm=kworker/11:2 next_pid=202 next_prio=120 kworker/11:2-202 [011] d.... 1158.884491: sched_waking: comm=CPU 0/KVM pid=2988 prio=120 target_cpu=009 kworker/11:2-202 [011] d.... 1158.884492: sched_stat_runtime: comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=76603911111 [ns] deadline=76606905961 [ns] lag=76606905961 ^^^^^^^^^^^^^^^^ kworker/11:2-202 [011] d.... 1158.884492: sched_update: comm=kworker/11:2 pid=202 sev=76603911111 sed=76606905961 sel=-1128 sew=1048576 avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=10000000 ^^^^^^^^^ kworker/11:2-202 [011] d.... 1158.884494: sched_stat_wait: comm=vhost-2961 pid=2984 delay=5150 [ns] kworker/11:2-202 [011] d.... 1158.884494: sched_switch: prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> next_comm=vhost-2961 next_pid=2984 next_prio=120 In the sequence above, the kworker gets woken up by the vhost and placed on the timeline with 0 lag. The kworker then executes for 5150ns and returns control to the vhost. Unfortunately, this short runtime earns the kworker a negative lag of -1128. This in turn, causes the kworker to not be selected by check_preempt_wakeup_fair. My naive understanding of lag is, that only those entities get negative lag, which consume more time than they should. Why is the kworker being punished for running only a tiny portion of time? In the majority of cases, the kworker finishes after a 4-digit number of ns. There are occassional outliers with 5-digit numbers. I would therefore not expect negative lag for the kworker. It is fair to say that the kworker was executing while the vhost was not. kworker gets put on the queue with no lag, so it essentially has its vruntime set to avg_vruntime. After giving up its timeslice the kworker has now a vruntime which is larger than the avg_vruntime. Hence the negative lag might make sense here from an algorithmic standpoint. 2a/b. vhost getting increased deadlines over time, no call of pick_eevdf vhost-2961-2984 [011] d.h.. 1158.892878: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] deadline=3606948448 [ns] lag=3598548386 vhost-2961-2984 [011] d.h.. 1158.892879: sched_stat_runtime: comm= pid=0 runtime=8385872 [ns] vruntime=76604158567 [ns] deadline=77624751246 [ns] lag=77624751246 <..> vhost-2961-2984 [011] d.h.. 1158.902877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999435 [ns] vruntime=3613947883 [ns] deadline=3616947883 [ns] lag=3598548386 vhost-2961-2984 [011] d.h.. 1158.902878: sched_stat_runtime: comm= pid=0 runtime=9999435 [ns] vruntime=76633826282 [ns] deadline=78137144356 [ns] lag=77624751246 <..> vhost-2961-2984 [011] d.h.. 1158.912877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999824 [ns] vruntime=3623947707 [ns] deadline=3626947707 [ns] lag=3598548386 vhost-2961-2984 [011] d.h.. 1158.912878: sched_stat_runtime: comm= pid=0 runtime=9999824 [ns] vruntime=76688003113 [ns] deadline=78161723086 [ns] lag=77624751246 <..> <..> vhost-2961-2984 [011] dN... 1159.152927: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=40402 [ns] vruntime=3863988069 [ns] deadline=3866947667 [ns] lag=3598548386 vhost-2961-2984 [011] dN... 1159.152928: sched_stat_runtime: comm= pid=0 runtime=40402 [ns] vruntime=78355923791 [ns] deadline=78393801472 [ns] lag=77624751246 In the sequence above, I extended the tracing of sched_stat_runtime to use for_each_sched_entity to also output the values for the cgroup hierarchy. The first entry represents the actual task, the second entry represents the root for that particular cgroup. I dropped the levels in between for readability. The first three groupings are happening in sequence. The fourth grouping is the last sched_stat_runtime update before the vhost gets migrated off the CPU. The ones in between repeat the same pattern. Interestingly, the vruntimes of the root grow faster than the actual tasks. I assume this is intended. At the same time, the deadlines keep on growing for vhost and the cgroup root. At the same time, the kworker is left starving with its negative lag. At no point in this sequence, pick_eevdf is being called. The only time pick_eevdf is being called is right when the kworker is woken up. So check_preempt_wakeup_fair seems to be the only chance for the kworker to get scheduled in time. For reference: vhost-2961-2984 [011] d.... 1158.884563: sched_place: comm=kworker/11:2 pid=202 sev=76604163719 sed=76607163719 sel=-1128 sew=1048576 avg=76604158567 min=76604158567 cpu=11 nr=1 vru=76604158567 lag=-5152 The kworker has a deadline which is definitely smaller than the one of vhost in later stages. So, I would assume it should get scheduled at some point. If vhost is running in kernel space and is therefore not preemptable, this would be expected behavior though. 3. vhost looping endlessly, waiting for kworker to be scheduled I dug a little deeper on what the vhost is doing. I'm not an expert on virtio whatsoever, so these are just educated guesses that maybe someone can verify/correct. Please bear with me probably messing up the terminology. - vhost is looping through available queues. - vhost wants to wake up a kworker to process a found queue. - kworker does something with that queue and terminates quickly. What I found by throwing in some very noisy trace statements was that, if the kworker is not woken up, the vhost just keeps looping accross all available queues (and seems to repeat itself). So it essentially relies on the scheduler to schedule the kworker fast enough. Otherwise it will just keep on looping until it is migrated off the CPU. SUMMARY 1 and 2a/b have some more or less plausible potential explanations, where the EEVDF scheduler might just do what it is designed to do. 3 is more tricky since I'm not familiar with the topic. If the vhost just relies on the kworker pre-empting the vhost, than this sounds a bit counter-intuitive. But there might also be a valid design decision behind this. If 1 and 2 are indeed plausible, path 3 is probably the one to go in order to figure out if we have a problem there. [...]