> -----Original Message----- > From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > Sent: Friday, April 30, 2021 10:43 PM > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Vincent Guittot > <vincent.guittot@xxxxxxxxxx> > Cc: tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx; > rjw@xxxxxxxxxxxxx; bp@xxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; > lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; rostedt@xxxxxxxxxxx; > bsegall@xxxxxxxxxx; mgorman@xxxxxxx; msys.mizuma@xxxxxxxxx; > valentin.schneider@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron > <jonathan.cameron@xxxxxxxxxx>; juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; > sudeep.holla@xxxxxxx; aubrey.li@xxxxxxxxxxxxxxx; > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; > Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; guodong.xu@xxxxxxxxxx; yangyicong > <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth) <liguozhu@xxxxxxxxxxxxx>; > linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks > within one LLC > > On 29/04/2021 00:41, Song Bao Hua (Barry Song) wrote: > > > > > >> -----Original Message----- > >> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > > [...] > > >>>>> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx] > >> > >> [...] > >> > >>>>> On 20/04/2021 02:18, Barry Song wrote: > > [...] > > > Though we will never go to slow path, wake_wide() will affect want_affine, > > so eventually affect the "new_cpu"? > > yes. > > > > > for_each_domain(cpu, tmp) { > > /* > > * If both 'cpu' and 'prev_cpu' are part of this domain, > > * cpu is a valid SD_WAKE_AFFINE target. > > */ > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && > > cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { > > if (cpu != prev_cpu) > > new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); > > > > sd = NULL; /* Prefer wake_affine over balance flags */ > > break; > > } > > > > if (tmp->flags & sd_flag) > > sd = tmp; > > else if (!want_affine) > > break; > > } > > > > If wake_affine is false, the above won't execute, new_cpu(target) will > > always be "prev_cpu"? so when task size > cluster size in wake_wide(), > > this means we won't pull the wakee to the cluster of waker? It seems > > sensible. > > What is `task size` here? > > The criterion is `!(slave < factor || master < slave * factor)` or > `slave >= factor && master >= slave * factor` to wake wide. > Yes. For "task size", I actually mean a bundle of waker-wakee tasks which can make "slave >= factor && master >= slave * factor" either true or false, then change the target cpu where we are going to scan from. Now since I have moved to cluster level when tasks have been in same LLC level, it seems it would be more sensible to use "cluster_size" as factor? > I see that since you effectively change the sched domain size from LLC > to CLUSTER (e.g. 24->6) for wakeups with cpu and prev_cpu sharing LLC > (hence the `numactl -N 0` in your workload), wake_wide() has to take > CLUSTER size into consideration. > > I was wondering if you saw wake_wide() returning 1 with your use cases: > > numactl -N 0 /usr/lib/lmbench/bin/stream -P [6,12] -M 1024M -N 5 I couldn't make wake_wide return 1 by the above stream command. And I can't reproduce it by a 1:1(monogamous) hackbench "-f 1". But I am able to reproduce this issue by a M:N hackbench, for example: numactl -N 0 hackbench -p -T -f 10 -l 20000 -g 1 hackbench will create 10 senders which will send messages to 10 receivers. (Each sender can send messages to all 10 receivers.) I've often seen flips like: waker wakee 1501 39 1509 17 11 1320 13 2016 11, 13, 17 is smaller than LLC but larger than cluster. So the wake_wide() using cluster factor will return 1, on the other hand, if we always use llc_size as factor, it will return 0. However, it seems the change in wake_wide() could bring some negative influence to M:N relationship(-f 10) according to tests made today by: numactl -N 0 hackbench -p -T -f 10 -l 20000 -g $1 g = 1 2 3 4 cluster_size 0.5768 0.6578 0.8117 1.0119 LLC_size 0.5479 0.6162 0.6922 0.7754 Always using llc_size as factor in wake_wide still shows better result in the 10:10 polygamous hackbench. So it seems the `slave >= factor && master >= slave * factor` isn't a suitable criterion for cluster size? Thanks Barry