RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

"Song Bao Hua (Barry Song)" <song.bao.hua@xxxxxxxxxxxxx> · Mon, 25 Jan 2021 11:12:04 +0000

> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx]
> Sent: Wednesday, January 13, 2021 1:53 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Morten Rasmussen
> <morten.rasmussen@xxxxxxx>; Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Cc: valentin.schneider@xxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx;
> rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; lenb@xxxxxxxxxx;
> gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
> mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; juri.lelli@xxxxxxxxxx;
> rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx; mgorman@xxxxxxx;
> mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx; aubrey.li@xxxxxxxxxxxxxxx;
> linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> linux-acpi@xxxxxxxxxxxxxxx; linuxarm@xxxxxxxxxxxxx; xuwei (O)
> <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; tiantao (H)
> <tiantao6@xxxxxxxxxxxxx>
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
> >
> >> -----Original Message-----
> >> From: Morten Rasmussen [mailto:morten.rasmussen@xxxxxxx]
> >> Sent: Saturday, January 9, 2021 4:13 AM
> >> To: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> >> Cc: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> >> valentin.schneider@xxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx;
> >> rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; lenb@xxxxxxxxxx;
> >> gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron
> <jonathan.cameron@xxxxxxxxxx>;
> >> mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; juri.lelli@xxxxxxxxxx;
> >> dietmar.eggemann@xxxxxxx; rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx;
> >> mgorman@xxxxxxx; mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx;
> >> aubrey.li@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> >> linux-kernel@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx;
> >> linuxarm@xxxxxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; Zengtao (B)
> >> <prime.zeng@xxxxxxxxxxxxx>; tiantao (H) <tiantao6@xxxxxxxxxxxxx>
> >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
> and
> >> add cluster scheduler
> >>
> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>> On 1/6/21 12:30 AM, Barry Song wrote:
> >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>>> has local L3 tag. On the other hand, each cluster will share some
> >>>> internal system bus. This means cache is much more affine inside one cluster
> >>>> than across clusters.
> >>>
> >>> There is a similar need for clustering in x86.  Some x86 cores could share
> >> L2 caches that
> >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
> clusters
> >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
> >> L3).
> >>> Having a sched domain at the L2 cluster helps spread load among
> >>> L2 domains.  This will reduce L2 cache contention and help with
> >>> performance for low to moderate load scenarios.
> >>
> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> >> between L2 caches while Barry is after consolidating tasks within the
> >> boundaries of a L3 tag cache. One helps cache utilization, the other
> >> communication latency between tasks. Am I missing something?
> >
> > Morten, this is not true.
> >
> > we are both actually looking for the same behavior. My patch also
> > has done the exact same behavior of spreading with Tim's patch.
> 
> That's the case for the load-balance path because of the extra Sched
> Domain (SD) (CLS/MC_L2) below MC.
> 
> But in wakeup you add code which leads to a different packing strategy.

Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().

> 
> It looks like that Tim's workload (SPECrate mcf) shows a performance
> boost solely because of the changes the additional MC_L2 SD introduces
> in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
> have to carefully distinguish between packing vs. spreading in wakeup
> and load-balance here.
> 
> > Considering the below two cases:
> > Case 1. we have two tasks without any relationship running in a system with
> 2 clusters and 8 cpus.
> >
> > Without the sched_domain of cluster, these two tasks might be put as below:
> > +-------------------+            +-----------------+
> > | +----+   +----+   |            |                 |
> > | |task|   |task|   |            |                 |
> > | |1   |   |2   |   |            |                 |
> > | +----+   +----+   |            |                 |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> > With the sched_domain of cluster, load balance will spread them as below:
> > +-------------------+            +-----------------+
> > | +----+            |            | +----+          |
> > | |task|            |            | |task|          |
> > | |1   |            |            | |2   |          |
> > | +----+            |            | +----+          |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> >
> > Then task1 and tasks2 get more cache and decrease cache contention.
> > They will get better performance.
> >
> > That is what my original patch also can make. And tim's patch
> > is also doing. Once we add a sched_domain, load balance will
> > get involved.
> >
> >
> > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
> > But they are working in 4 groups:
> > Task1 wakes up task4
> > Task2 wakes up task5
> > Task3 wakes up task6
> > Task4 wakes up task7
> >
> > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
> > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters
> rather
> > than putting all of them in the random one of the 8 cpus. However, the 8 tasks
> > are still spreading among the 8 cpus with my change in select_idle_sibling
> > as load balance is still working.
> >
> > +---------------------------+    +----------------------+
> > | +----+        +-----+     |    | +----+      +-----+  |
> > | |task|        |task |     |    | |task|      |task |  |
> > | |1   |        | 4   |     |    | |2   |      |5    |  |
> > | +----+        +-----+     |    | +----+      +-----+  |
> > |                           |    |                      |
> > |       cluster1            |    |     cluster2         |
> > |                           |    |                      |
> > |                           |    |                      |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > | |task |       | task |    |    | |task |     |task  | |
> > | |3    |       |  6   |    |    | |4    |     |8     | |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > +---------------------------+    +----------------------+
> 
> Your use-case (#tasks, runtime/period) seems to be perfectly crafted to
> show the benefit of your patch on your specific system (cluster-size =
> 4). IMHO, this extra infrastructure especially in the wakeup path should
> show benefits over a range of different benchmarks.
> 
> > Let's consider the 3rd case, that one would be more tricky:
> >
> > task1 and task2 have close relationship and they are waker-wakee pair.
> > With my current patch, select_idle_sidling() wants to put them in one
> > cluster, load balance wants to put them in two clusters. Load balance will
> win.
> > Then maybe we need some same mechanism like adjusting numa imbalance:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> kernel/sched/fair.c?id=b396f52326de20
> > if we permit a light imbalance between clusters, select_idle_sidling()
> > will win. And task1 and task2 get better cache affinity.
> 
> This would look weird to allow this kind of imbalance on CLS (MC_L2) and
> NUMA domains but not on the MC domain for example.

Yes. I guess I actually meant permitting imbalance between sched_group
made by the child sched_cluster domain of the parent sched_mc domain.

sched_mc domain

+----------------------------------+
|   +--------+     +----------+    |
|   |sched_  |     |sched_    |    |
|   |group   |     |group     |    |
|   +--+-----+     +----+-----+    |
|      |  allow small   |          |
|      |  imbalance     |          |
+----------------------------------+
       |                |
       |                |
       |                |
       |                |
       |                |
       +                +
   child domain:     child domain:
   sched_cluster     sched_cluster

For sched_group within one sched_cluster domain, we don't allow this
kind of imbalance.

Anyway, I would be happier to see this kind of imbalance is
only allowed when we exactly know two tasks in the cluster
have wake-up relationship. Right now, SD_NUMA seems to be
simply allowing this imbalance without the knowledge of the
relationships of tasks causing imbalance.

Thanks
Barry