Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Tobias Huschle <huschle@xxxxxxxxxxxxx> · Mon, 27 Nov 2023 14:56:18 +0100

On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
> 
> The below should also work for internal entities, but last time I poked
> around with it I had some regressions elsewhere -- you know, fix one,
> wreck another type of situations on hand.
> 
> But still, could you please give it a go -- it applies cleanly to linus'
> master and -rc2.
> 
> ---
> Subject: sched/eevdf: Revenge of the Sith^WSleepers
> 

Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the 
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:       
                    vruntime      deadline
   cgroup        56117619190   57650477291 -> depth 0
   kworker       56117624131   56120619190
  This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller 
  by a factor of 20, but depth 0 seems to be running with values of the 
  correct magnitude.
- cgroup at depth 0 has zero lag, with higher depth, there are large lag 
  values (as observed 606.338267 onwards)

Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:       
                    vruntime      deadline
   cgroup        56117619190   57650477291 -> depth 0, last known value
   kworker       56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here 
  I would need to enhance the trace to also print the vruntimes of the 
  parent sched_entities to see the progress of their vruntime/deadline/lag 
  values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker 
  over the cgroup as its deadline is smaller.
  But, the kworker has negative lag, which might cause EEVDF to not pick 
  the kworker.
  The cgroup at depth 0 has no lag, all deeper layers have a significantly 
  positve lag (last known values, might have changed in the meantime).
  At this point I would see the option that the vhost task is stuck
  somewhere or EEVDF just does not see the kworker as a eligible option.

- Once the vhost is migrated off the cpu, the update_entity_lag function
  works with the following values at 606.467022: sched_update
  For the cgroup at depth 0
  - vruntime = 57104166665 --> this is in line with the amount of new timeslices
                               vhost got assigned while the kworker was waiting
  - vlag     =   -62439022 --> the scheduler knows that the cgroup was 
                               overconsuming, but no runtime for the kworker
  For the cfs_rq we have
  - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
  - avg_vruntime = 161750065796 --> this is rather large in comparison, but I 
                                    might access this value at a bad time
  - nr_running   =            2 --> at this point, both, cgroup and kworker are 
                                    still on the queue, with the cgroup being 
                                    in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not 
    propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
  onto cpu 13 with a lag of -166821875 it gets scheduled right away as 
  there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
  with no lag and get scheduled almost immediately, with a wait 
  time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

######################### full trace #########################

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> during __pick_eevdf, prints values for best and the first node loop variable, second loop is never executed

sched_place/sched_update: sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
--> at the end of place_entity and update_entity_lag

--> the chunks of 5 entries for these new events represent the 5 levels of the cgroup which hosts the vhost

    vhost-2931-2953    [013] d....   606.338262: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=90133345 [ns]
    vhost-2931-2953    [013] d....   606.338262: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
    vhost-2931-2953    [013] dN...   606.338263: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] dN...   606.338263: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
    vhost-2931-2953    [013] dN...   606.338263: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=17910 [ns] vruntime=2099190650 [ns] deadline=2102172740 [ns] lag=2102172740
    vhost-2931-2953    [013] dN...   606.338264: sched_stat_wait: comm=kworker/13:1 pid=168 delay=0 [ns]
    vhost-2931-2953    [013] d....   606.338264: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
--> kworker allowed to execute
  kworker/13:1-168     [013] d....   606.338266: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
  kworker/13:1-168     [013] d....   606.338267: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=4941 [ns] vruntime=56117624131 [ns] deadline=56120619190 [ns] lag=56120619190
--> runtime of 4941 ns
  kworker/13:1-168     [013] d....   606.338267: sched_update: comm=kworker/13:1 pid=168 sev=56117624131 sed=56120619190 sel=-725 avg=0 min=56117619190 cpu=13 nr=2 lag=-725 lim=10000000
  kworker/13:1-168     [013] d....   606.338267: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117619190 d=57650477291 l=0 md=57650477291 dp=0
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56117619190   57650477291
                        kworker       56117624131   56120619190
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=29822481776 d=29834647752 l=29834647752 md=29834647752 dp=1
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=21909608438 d=21919458955 l=21919458955 md=21919458955 dp=2
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=11306038504 d=11312426915 l=11312426915 md=11312426915 dp=3
  kworker/13:1-168     [013] d....   606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2099190650 d=2102172740 l=2102172740 md=2102172740 dp=4
  kworker/13:1-168     [013] d....   606.338268: sched_stat_wait: comm=vhost-2931 pid=2953 delay=4941 [ns]
  kworker/13:1-168     [013] d....   606.338269: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
    vhost-2931-2953    [013] d....   606.338311: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.338312: sched_place: comm=kworker/13:1 pid=168 sev=56117885776 sed=56120885776 sel=-725 avg=0 min=56117880833 cpu=13 nr=1 vru=56117880833 lag=-725
--> kworker gets placed again
    vhost-2931-2953    [013] d....   606.338312: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=44970 [ns]
    vhost-2931-2953    [013] d....   606.338313: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
--> kworker set to runnable, but vhost keeps on executing
    vhost-2931-2953    [013] d.h..   606.346964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=8697702 [ns] vruntime=2107888352 [ns] deadline=2110888352 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.356964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999583 [ns] vruntime=2117887935 [ns] deadline=2120887935 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.366964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000089 [ns] vruntime=2127888024 [ns] deadline=2130888024 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.376964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999716 [ns] vruntime=2137887740 [ns] deadline=2140887740 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.386964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000179 [ns] vruntime=2147887919 [ns] deadline=2150887919 [ns] lag=2102172740
    vhost-2931-2953    [013] D....   606.392250: sched_waking: comm=vhost-2306 pid=2324 prio=120 target_cpu=018
    vhost-2931-2953    [013] D....   606.392388: sched_waking: comm=vhost-2306 pid=2321 prio=120 target_cpu=017
    vhost-2931-2953    [013] D....   606.392390: sched_migrate_task: comm=vhost-2306 pid=2321 prio=120 orig_cpu=17 dest_cpu=23
    vhost-2931-2953    [013] d.h..   606.396964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000187 [ns] vruntime=2157888106 [ns] deadline=2160888106 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.406964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000112 [ns] vruntime=2167888218 [ns] deadline=2170888218 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.416964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999779 [ns] vruntime=2177887997 [ns] deadline=2180887997 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.426964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999667 [ns] vruntime=2187887664 [ns] deadline=2190887664 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.436964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000329 [ns] vruntime=2197887993 [ns] deadline=2200887993 [ns] lag=2102172740
    vhost-2931-2953    [013] D....   606.441980: sched_waking: comm=vhost-2306 pid=2325 prio=120 target_cpu=021
    vhost-2931-2953    [013] d.h..   606.446964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000069 [ns] vruntime=2207888062 [ns] deadline=2210888062 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.456964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999977 [ns] vruntime=2217888039 [ns] deadline=2220888039 [ns] lag=2102172740
    vhost-2931-2953    [013] d.h..   606.466964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999548 [ns] vruntime=2227887587 [ns] deadline=2230887587 [ns] lag=2102172740
    vhost-2931-2953    [013] dNh..   606.466979: sched_wakeup: comm=migration/13 pid=80 prio=0 target_cpu=013
    vhost-2931-2953    [013] dN...   606.467017: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=41352 [ns] vruntime=2227928939 [ns] deadline=2230887587 [ns] lag=2102172740
    vhost-2931-2953    [013] d....   606.467018: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=migration/13 next_pid=80 next_prio=0
  migration/13-80      [013] d..1.   606.467020: sched_update: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230887587 sel=0 avg=0 min=2227928939 cpu=13 nr=1 lag=0 lim=10000000
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=12075393889 sed=12087868931 sel=0 avg=0 min=12075393889 cpu=13 nr=1 lag=0 lim=42139916
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=23017543001 sed=23036322254 sel=0 avg=0 min=23017543001 cpu=13 nr=1 lag=0 lim=63209874
  migration/13-80      [013] d..1.   606.467021: sched_update: comm= pid=0 sev=30619368612 sed=30633124735 sel=0 avg=0 min=30619368612 cpu=13 nr=1 lag=0 lim=46126124
  migration/13-80      [013] d..1.   606.467022: sched_update: comm= pid=0 sev=57104166665 sed=57945071818 sel=-62439022 avg=161750065796 min=56117885776 cpu=13 nr=2 lag=-62439022 lim=62439022
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        57104166665   57945071818
                        kworker       56117885776   56120885776  --> last known values
--> cgroup's lag of -62439022 indicates that the scheduler knows that the cgroup ran for too long
--> nr=2 shows that the cgroup and the kworker are currently on the runqueue
  migration/13-80      [013] d..1.   606.467022: sched_migrate_task: comm=vhost-2931 pid=2953 prio=120 orig_cpu=13 dest_cpu=12
  migration/13-80      [013] d..1.   606.467023: sched_place: comm=vhost-2931 pid=2953 sev=2994881412 sed=2997881412 sel=0 avg=0 min=2994881412 cpu=12 nr=0 vru=2994881412 lag=0
  migration/13-80      [013] d..1.   606.467023: sched_place: comm= pid=0 sev=16617220304 sed=16632657489 sel=0 avg=0 min=16617220304 cpu=12 nr=0 vru=16617220304 lag=0
  migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=30778525102 sed=30804781512 sel=0 avg=0 min=30778525102 cpu=12 nr=0 vru=30778525102 lag=0
  migration/13-80      [013] d..1.   606.467024: sched_place: comm= pid=0 sev=38704326194 sed=38724404624 sel=0 avg=0 min=38704326194 cpu=12 nr=0 vru=38704326194 lag=0
  migration/13-80      [013] d..1.   606.467025: sched_place: comm= pid=0 sev=66383057731 sed=66409091628 sel=-30739032 avg=0 min=66383057731 cpu=12 nr=0 vru=66383057731 lag=0
--> vhost migrated off to CPU 12
  migration/13-80      [013] d....   606.467026: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56117885776 d=56120885776 l=-725 md=56120885776 dp=0
  migration/13-80      [013] d....   606.467026: sched_stat_wait: comm=kworker/13:1 pid=168 delay=128714004 [ns]
  migration/13-80      [013] d....   606.467027: sched_switch: prev_comm=migration/13 prev_pid=80 prev_prio=0 prev_state=S ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
--> kworker runs next
  kworker/13:1-168     [013] d....   606.467030: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
  kworker/13:1-168     [013] d....   606.467032: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=6163 [ns] vruntime=56117891939 [ns] deadline=56120885776 [ns] lag=56120885776
  kworker/13:1-168     [013] d....   606.467032: sched_update: comm=kworker/13:1 pid=168 sev=56117891939 sed=56120885776 sel=0 avg=0 min=56117891939 cpu=13 nr=1 lag=0 lim=10000000
  kworker/13:1-168     [013] d....   606.467033: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=swapper/13 next_pid=0 next_prio=120
--> kworker finishes
        <idle>-0       [013] d.h..   606.479977: sched_place: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230928939 sel=0 avg=0 min=2227928939 cpu=13 nr=0 vru=2227928939 lag=0
--> vhost migrated back and placed on CPU 13 again
        <idle>-0       [013] d.h..   606.479977: sched_stat_sleep: comm=vhost-2931 pid=2953 delay=27874 [ns]
        <idle>-0       [013] d.h..   606.479977: sched_place: comm= pid=0 sev=12075393889 sed=12099393888 sel=0 avg=0 min=12075393889 cpu=13 nr=0 vru=12075393889 lag=0
        <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=23017543001 sed=23056927616 sel=0 avg=0 min=23017543001 cpu=13 nr=0 vru=23017543001 lag=0
        <idle>-0       [013] d.h..   606.479978: sched_place: comm= pid=0 sev=30619368612 sed=30648907073 sel=0 avg=0 min=30619368612 cpu=13 nr=0 vru=30619368612 lag=0
        <idle>-0       [013] d.h..   606.479979: sched_place: comm= pid=0 sev=56117891939 sed=56168252594 sel=-166821875 avg=0 min=56117891939 cpu=13 nr=0 vru=56117891939 lag=0
        <idle>-0       [013] dNh..   606.479979: sched_wakeup: comm=vhost-2931 pid=2953 prio=120 target_cpu=013
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117891939 d=56168252594 l=-166821875 md=56168252594 dp=0
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56117891939   56168252594
                        kworker       56117891939   56120885776
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=30619368612 d=30648907073 l=0 md=30648907073 dp=1
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=23017543001 d=23056927616 l=0 md=23056927616 dp=2
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=12075393889 d=12099393888 l=0 md=12099393888 dp=3
        <idle>-0       [013] dN...   606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2227928939 d=2230928939 l=0 md=2230928939 dp=4
        <idle>-0       [013] dN...   606.479982: sched_stat_wait: comm=vhost-2931 pid=2953 delay=0 [ns]
        <idle>-0       [013] d....   606.479982: sched_switch: prev_comm=swapper/13 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
--> vhost can continue to bully the kworker
    vhost-2931-2953    [013] d....   606.479995: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.479996: sched_place: comm=kworker/13:1 pid=168 sev=56118220659 sed=56121220659 sel=0 avg=0 min=56118220659 cpu=13 nr=1 vru=56118220659 lag=0
    vhost-2931-2953    [013] d....   606.479996: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=12964004 [ns]
    vhost-2931-2953    [013] d....   606.479997: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
    vhost-2931-2953    [013] d....   606.479997: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=20837 [ns] vruntime=2227949776 [ns] deadline=2230928939 [ns] lag=2230928939
    vhost-2931-2953    [013] d....   606.479997: sched_update: comm=vhost-2931 pid=2953 sev=2227949776 sed=2230928939 sel=0 avg=0 min=2227949776 cpu=13 nr=1 lag=0 lim=10000000
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=12075560584 sed=12099393888 sel=0 avg=0 min=12075560584 cpu=13 nr=1 lag=0 lim=79999997
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=23017816553 sed=23056927616 sel=0 avg=0 min=23017816553 cpu=13 nr=1 lag=0 lim=131282050
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=30619573776 sed=30648907073 sel=0 avg=0 min=30619573776 cpu=13 nr=1 lag=0 lim=98461537
    vhost-2931-2953    [013] d....   606.479998: sched_update: comm= pid=0 sev=56118241726 sed=56168252594 sel=-19883 avg=0 min=56118220659 cpu=13 nr=2 lag=-19883 lim=167868850
    vhost-2931-2953    [013] d....   606.479999: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56118220659 d=56121220659 l=0 md=56121220659 dp=0
    vhost-2931-2953    [013] d....   606.479999: sched_stat_wait: comm=kworker/13:1 pid=168 delay=1255 [ns]
--> good delay of 1255 ns for the kworker
--> depth 0 of cgroup holding vhost:     vruntime      deadline
                        cgroup        56118241726   56168252594
                        kworker       56118220659   56121220659