> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Monday, May 3, 2021 6:12 PM > To: 'Dietmar Eggemann' <dietmar.eggemann@xxxxxxx>; Vincent Guittot > <vincent.guittot@xxxxxxxxxx> > Cc: tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx; > rjw@xxxxxxxxxxxxx; bp@xxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; > lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; rostedt@xxxxxxxxxxx; > bsegall@xxxxxxxxxx; mgorman@xxxxxxx; msys.mizuma@xxxxxxxxx; > valentin.schneider@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron > <jonathan.cameron@xxxxxxxxxx>; juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; > sudeep.holla@xxxxxxx; aubrey.li@xxxxxxxxxxxxxxx; > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; > Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; guodong.xu@xxxxxxxxxx; yangyicong > <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth) <liguozhu@xxxxxxxxxxxxx>; > linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx > Subject: RE: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks > within one LLC > > > > > -----Original Message----- > > From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > > Sent: Friday, April 30, 2021 10:43 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Vincent Guittot > > <vincent.guittot@xxxxxxxxxx> > > Cc: tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx; > > rjw@xxxxxxxxxxxxx; bp@xxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; > > lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; rostedt@xxxxxxxxxxx; > > bsegall@xxxxxxxxxx; mgorman@xxxxxxx; msys.mizuma@xxxxxxxxx; > > valentin.schneider@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron > > <jonathan.cameron@xxxxxxxxxx>; juri.lelli@xxxxxxxxxx; > mark.rutland@xxxxxxx; > > sudeep.holla@xxxxxxx; aubrey.li@xxxxxxxxxxxxxxx; > > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; > > Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; guodong.xu@xxxxxxxxxx; yangyicong > > <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth) <liguozhu@xxxxxxxxxxxxx>; > > linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx > > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks > > within one LLC > > > > On 29/04/2021 00:41, Song Bao Hua (Barry Song) wrote: > > > > > > > > >> -----Original Message----- > > >> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > > > > [...] > > > > >>>>> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > > >> > > >> [...] > > >> > > >>>>> On 20/04/2021 02:18, Barry Song wrote: > > > > [...] > > > > > Though we will never go to slow path, wake_wide() will affect want_affine, > > > so eventually affect the "new_cpu"? > > > > yes. > > > > > > > > for_each_domain(cpu, tmp) { > > > /* > > > * If both 'cpu' and 'prev_cpu' are part of this domain, > > > * cpu is a valid SD_WAKE_AFFINE target. > > > */ > > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && > > > cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { > > > if (cpu != prev_cpu) > > > new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); > > > > > > sd = NULL; /* Prefer wake_affine over balance flags */ > > > break; > > > } > > > > > > if (tmp->flags & sd_flag) > > > sd = tmp; > > > else if (!want_affine) > > > break; > > > } > > > > > > If wake_affine is false, the above won't execute, new_cpu(target) will > > > always be "prev_cpu"? so when task size > cluster size in wake_wide(), > > > this means we won't pull the wakee to the cluster of waker? It seems > > > sensible. > > > > What is `task size` here? > > > > The criterion is `!(slave < factor || master < slave * factor)` or > > `slave >= factor && master >= slave * factor` to wake wide. > > > > Yes. For "task size", I actually mean a bundle of waker-wakee tasks > which can make "slave >= factor && master >= slave * factor" either > true or false, then change the target cpu where we are going to scan > from. > Now since I have moved to cluster level when tasks have been in same > LLC level, it seems it would be more sensible to use "cluster_size" as > factor? > > > I see that since you effectively change the sched domain size from LLC > > to CLUSTER (e.g. 24->6) for wakeups with cpu and prev_cpu sharing LLC > > (hence the `numactl -N 0` in your workload), wake_wide() has to take > > CLUSTER size into consideration. > > > > I was wondering if you saw wake_wide() returning 1 with your use cases: > > > > numactl -N 0 /usr/lib/lmbench/bin/stream -P [6,12] -M 1024M -N 5 > > I couldn't make wake_wide return 1 by the above stream command. > And I can't reproduce it by a 1:1(monogamous) hackbench "-f 1". > > But I am able to reproduce this issue by a M:N hackbench, for example: > > numactl -N 0 hackbench -p -T -f 10 -l 20000 -g 1 > > hackbench will create 10 senders which will send messages to 10 > receivers. (Each sender can send messages to all 10 receivers.) > > I've often seen flips like: > waker wakee > 1501 39 > 1509 17 > 11 1320 > 13 2016 > > 11, 13, 17 is smaller than LLC but larger than cluster. So the wake_wide() > using cluster factor will return 1, on the other hand, if we always use > llc_size as factor, it will return 0. > > However, it seems the change in wake_wide() could bring some negative > influence to M:N relationship(-f 10) according to tests made today by: > > numactl -N 0 hackbench -p -T -f 10 -l 20000 -g $1 > > g = 1 2 3 4 > cluster_size 0.5768 0.6578 0.8117 1.0119 > LLC_size 0.5479 0.6162 0.6922 0.7754 > > Always using llc_size as factor in wake_wide still shows better result > in the 10:10 polygamous hackbench. > > So it seems the `slave >= factor && master >= slave * factor` isn't > a suitable criterion for cluster size? On the other hand, according to "sched: Implement smarter wake-affine logic" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62470419 Proper factor in wake_wide is mainly beneficial of 1:n tasks like postgresql/pgbench. So using the smaller cluster size as factor might help make wake_affine false so improve pgbench. >From the commit log, while clients = 2*cpus, the commit made the biggest improvement. In my case, It should be clients=48 for a machine whose LLC size is 24. In Linux, I created a 240MB database and ran "pgbench -c 48 -S -T 20 pgbench" under two different scenarios: 1. page cache always hit, so no real I/O for database read 2. echo 3 > /proc/sys/vm/drop_caches For case 1, using cluster_size and using llc_size will result in similar tps= ~108000, all of 24 cpus have 100% cpu utilization. For case 2, using llc_size still shows better performance. tps for each test round(cluster size as factor in wake_wide): 1398.450887 1275.020401 1632.542437 1412.241627 1611.095692 1381.354294 1539.877146 avg tps = 1464 tps for each test round(llc size as factor in wake_wide): 1718.402983 1443.169823 1502.353823 1607.415861 1597.396924 1745.651814 1876.802168 avg tps = 1641 (+12%) so it seems using cluster_size as factor in "slave >= factor && master >= slave * factor" isn't a good choice for my machine at least. Thanks Barry