On Wed, 28 Apr 2021 at 11:51, Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > -----Original Message----- > > From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > > Sent: Tuesday, April 27, 2021 11:36 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; > > tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx; > > rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; bp@xxxxxxxxx; > > tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; > > rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx; mgorman@xxxxxxx > > Cc: msys.mizuma@xxxxxxxxx; valentin.schneider@xxxxxxx; > > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; > > juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx; > > aubrey.li@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; > > linux-kernel@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; > > xuwei (O) <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; > > guodong.xu@xxxxxxxxxx; yangyicong <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth) > > <liguozhu@xxxxxxxxxxxxx>; linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx > > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks > > within one LLC > > > > On 20/04/2021 02:18, Barry Song wrote: > > > > [...] > > > > > @@ -5786,11 +5786,12 @@ static void record_wakee(struct task_struct *p) > > > * whatever is irrelevant, spread criteria is apparent partner count exceeds > > > * socket size. > > > */ > > > -static int wake_wide(struct task_struct *p) > > > +static int wake_wide(struct task_struct *p, int cluster) > > > { > > > unsigned int master = current->wakee_flips; > > > unsigned int slave = p->wakee_flips; > > > - int factor = __this_cpu_read(sd_llc_size); > > > + int factor = cluster ? __this_cpu_read(sd_cluster_size) : > > > + __this_cpu_read(sd_llc_size); > > > > I don't see that the wake_wide() change has any effect here. None of the > > sched domains has SD_BALANCE_WAKE set so a wakeup (WF_TTWU) can never > > end up in the slow path. > > I am really confused. The whole code has only checked if wake_flags > has WF_TTWU, it has never checked if sd_domain has SD_BALANCE_WAKE flag. look at : #define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ so when wake_wide return false, we use the wake_affine mecanism but if it's false then we fllback to default mode which looks for: if (tmp->flags & sd_flag) This means looking for SD_BALANCE_WAKE which is never set so sd will stay NULL and you will end up calling select_idle_sibling anyway > > static int > select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > { > ... > > if (wake_flags & WF_TTWU) { > record_wakee(p); > > if (sched_energy_enabled()) { > new_cpu = find_energy_efficient_cpu(p, prev_cpu); > if (new_cpu >= 0) > return new_cpu; > new_cpu = prev_cpu; > } > > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > } > } > > And try_to_wake_up() has always set WF_TTWU: > static int > try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > { > cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); > ... > } > > So the change in wake_wide will actually affect the value of want_affine. > And I did also see code entered slow path during my benchmark. > > One issue I mentioned during linaro open discussion is that > since I have moved to use cluster size to decide the value > of wake_wide, relatively less tasks will make wake_wide() > decide to go to slow path, thus, tasks begin to spread to > other NUMA, but actually llc_size might be able to contain > those tasks. So a possible model might be: > static int wake_wide(struct task_struct *p) > { > tasksize < cluster : scan cluster > tasksize > llc : slow path > tasksize > cluster && tasksize < llc: scan llc > } > > thoughts? > > > Have you seen a diff when running your `lmbench stream` workload in what > > wake_wide() returns when you use `sd cluster size` instead of `sd llc > > size` as factor? > > > > I guess for you, wakeups are now subdivided into faster (cluster = 4 > > CPUs) and fast (llc = 24 CPUs) via sis(), not into fast (sis()) and slow > > (find_idlest_cpu()). > > > > > > > > if (master < slave) > > > swap(master, slave); > > > > [...] > > > > > @@ -6745,6 +6748,12 @@ static int find_energy_efficient_cpu(struct > > task_struct *p, int prev_cpu) > > > int want_affine = 0; > > > /* SD_flags and WF_flags share the first nibble */ > > > int sd_flag = wake_flags & 0xF; > > > + /* > > > + * if cpu and prev_cpu share LLC, consider cluster sibling rather > > > + * than llc. this is typically true while tasks are bound within > > > + * one numa > > > + */ > > > + int cluster = sched_cluster_active() && cpus_share_cache(cpu, prev_cpu, 0); > > > > So you changed from scanning cluster before LLC to scan either cluster > > or LLC. > > Yes, I have seen two ugly things for scanning cluster before scanning LLC > in select_idle_cpu. > 1. avg_scan_cost is actually measuring the scanning time. But if we scan > cluster before scanning LLC, during the gap of these two different > domains, we need a huge bit operation and this bit operation is not > a scanning operation at all. This makes the scan_cost quite > untrustworthy particularly "nr" can sometimes be < cluster size, sometimes > > cluster size. > > 2. select_idle_cpu() is actually the last step of wake_affine, before > that, wake_affine code has been totally depending on cpus_share_cache() > to decide the target to scan from. When waker and wakee have been already > in one LLC, if we only change select_idle_cpu(), at that time, decision > has been made. we may lose some chance to choose the right target to scan > from. So it should be more sensible to let cpus_share_cache() check cluster > when related tasks have been in one same LLC. > > > > > And this is based on whether `this_cpu` and `prev_cpu` are sharing LLC > > or not. So you only see an effect when running the workload with > > `numactl -N X ...`. > > Ideally, I'd expect this can also positively affect tasks located in > different LLCs. > For example, if taskA and taskB are in different NUMAs(also LLC for both > kunpeng920 and Tim's hardware) at the first beginning, a two-stage packing > might make them take the advantage of cluster: > For the first wake-up, taskA and taskB will be put in one LLC by scanning > LLC; > For the second wake-up, they might be put in one cluster by scanning > cluster. > So ideally, scanning LLC and scanning cluster can work in two stages > for related tasks and pack them step by step. Practically, this > could happen. But LB between NUMA might take the opposite way. Actually, > for a kernel completely *without* cluster patch, I have seen some > serious ping-pong of tasks in two numa nodes due to the conflict > of wake_affine and LB. this kind of ping-pong could seriously affect > the performance. > For example, for g=6,12,18,24,28,32, I have found running same workload > on 2numa shows so much worse latency than doing that on single one > numa(each numa has 24 cpus). > 1numa command: numactl -N 0 hackbench -p -T -l 1000000 -f 1 -g $1 > 2numa command: numactl -N 0-1 hackbench -p -T -l 1000000 -f 1 -g $1 > > Measured the average latency of 20 times for each command. > > *)without cluster scheduler, 2numa vs 1numa: > g = 6 12 18 24 28 32 > 1numa 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898 > 2numa 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419 > > BTW, my cluster patch actually also improves 2numa: > *)with cluster scheduler 2numa vs 1numa: > g = 6 12 18 24 28 32 > 1numa 0.9500 1.0728 1.1756 1.2201 1.4166 1.5464 > 2numa 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908 > > *) 2numa w/ and w/o cluster: > g = 6 12 18 24 28 32 > 2numa w/o 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419 > 2numa w/ 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908 > > Ideally, load balance should try to pull unmarried tasks rather than > married tasks. I mean, if we have > groupA: task1+task2 as couple, task3 as bachelor > groupB: task4. > groupB should try to pull task3. But I feel it is extremely hard to let > LB understand who is married and who is unmarried. > > I assume 2numa worse than 1numa should be a different topic > which might be worth more investigation. > > On the other hand, use cases I'm now targeting at are really using > "numactl -N x" to bind processes in one NUMA. If we ignore other NUMA > (also other LLCs) and think one NUMA as a whole system, cluster would > be the last-level topology scheduler can use. And the code could be > quite clean to directly leverage the existing select_sibling code for > LLC by simply changing cpus_share_cache() to cluster level. > > > > > > > > > if (wake_flags & WF_TTWU) { > > > record_wakee(p); > > > @@ -6756,7 +6765,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > new_cpu = prev_cpu; > > > } > > > > > > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > > > + want_affine = !wake_wide(p, cluster) && cpumask_test_cpu(cpu, > > p->cpus_ptr); > > > } > > > > > > rcu_read_lock(); > > > @@ -6768,7 +6777,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && > > > cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { > > > if (cpu != prev_cpu) > > > - new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); > > > + new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, cluster); > > > > > > sd = NULL; /* Prefer wake_affine over balance flags */ > > > break; > > > @@ -6785,7 +6794,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); > > > } else if (wake_flags & WF_TTWU) { /* XXX always ? */ > > > /* Fast path */ > > > - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > > > + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, cluster); > > > > > > if (want_affine) > > > current->recent_used_cpu = cpu; > > > > [...] > > Thanks > Barry >