> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Wednesday, December 2, 2020 11:41 PM > To: 'Vincent Guittot' <vincent.guittot@xxxxxxxxxx> > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>; Catalin Marinas > <catalin.marinas@xxxxxxx>; Will Deacon <will@xxxxxxxxxx>; Rafael J. Wysocki > <rjw@xxxxxxxxxxxxx>; Cc: Len Brown <lenb@xxxxxxxxxx>; > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; > Ingo Molnar <mingo@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Juri > Lelli <juri.lelli@xxxxxxxxxx>; Dietmar Eggemann <dietmar.eggemann@xxxxxxx>; > Steven Rostedt <rostedt@xxxxxxxxxxx>; Ben Segall <bsegall@xxxxxxxxxx>; Mel > Gorman <mgorman@xxxxxxx>; Mark Rutland <mark.rutland@xxxxxxx>; LAK > <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>; linux-kernel > <linux-kernel@xxxxxxxxxxxxxxx>; ACPI Devel Maling List > <linux-acpi@xxxxxxxxxxxxxxx>; Linuxarm <linuxarm@xxxxxxxxxx>; xuwei (O) > <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx> > Subject: RE: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters > > > > > -----Original Message----- > > From: Vincent Guittot [mailto:vincent.guittot@xxxxxxxxxx] > > Sent: Wednesday, December 2, 2020 11:17 PM > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>; Catalin Marinas > > <catalin.marinas@xxxxxxx>; Will Deacon <will@xxxxxxxxxx>; Rafael J. Wysocki > > <rjw@xxxxxxxxxxxxx>; Cc: Len Brown <lenb@xxxxxxxxxx>; > > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; > > Ingo Molnar <mingo@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Juri > > Lelli <juri.lelli@xxxxxxxxxx>; Dietmar Eggemann <dietmar.eggemann@xxxxxxx>; > > Steven Rostedt <rostedt@xxxxxxxxxxx>; Ben Segall <bsegall@xxxxxxxxxx>; Mel > > Gorman <mgorman@xxxxxxx>; Mark Rutland <mark.rutland@xxxxxxx>; LAK > > <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>; linux-kernel > > <linux-kernel@xxxxxxxxxxxxxxx>; ACPI Devel Maling List > > <linux-acpi@xxxxxxxxxxxxxxx>; Linuxarm <linuxarm@xxxxxxxxxx>; xuwei (O) > > <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx> > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters > > > > On Wed, 2 Dec 2020 at 10:20, Song Bao Hua (Barry Song) > > <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Vincent Guittot [mailto:vincent.guittot@xxxxxxxxxx] > > > > Sent: Wednesday, December 2, 2020 9:27 PM > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > > > > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>; Catalin Marinas > > > > <catalin.marinas@xxxxxxx>; Will Deacon <will@xxxxxxxxxx>; Rafael J. > Wysocki > > > > <rjw@xxxxxxxxxxxxx>; Cc: Len Brown <lenb@xxxxxxxxxx>; > > > > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron > > <jonathan.cameron@xxxxxxxxxx>; > > > > Ingo Molnar <mingo@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; > Juri > > > > Lelli <juri.lelli@xxxxxxxxxx>; Dietmar Eggemann > > <dietmar.eggemann@xxxxxxx>; > > > > Steven Rostedt <rostedt@xxxxxxxxxxx>; Ben Segall <bsegall@xxxxxxxxxx>; > Mel > > > > Gorman <mgorman@xxxxxxx>; Mark Rutland <mark.rutland@xxxxxxx>; LAK > > > > <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>; linux-kernel > > > > <linux-kernel@xxxxxxxxxxxxxxx>; ACPI Devel Maling List > > > > <linux-acpi@xxxxxxxxxxxxxxx>; Linuxarm <linuxarm@xxxxxxxxxx>; xuwei (O) > > > > <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx> > > > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters > > > > > > > > On Tue, 1 Dec 2020 at 04:04, Barry Song <song.bao.hua@xxxxxxxxxxxxx> wrote: > > > > > > > > > > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and > each > > > > > cluster has 4 cpus. All clusters share L3 cache data, but each cluster > > > > > has local L3 tag. On the other hand, each clusters will share some > > > > > internal system bus. This means cache coherence overhead inside one > cluster > > > > > is much less than the overhead across clusters. > > > > > > > > > > +-----------------------------------+ +---------+ > > > > > | +------+ +------+ +---------------------------+ | > > > > > | | CPU0 | | cpu1 | | +-----------+ | | > > > > > | +------+ +------+ | | | | | > > > > > | +----+ L3 | | | > > > > > | +------+ +------+ cluster | | tag | | | > > > > > | | CPU2 | | CPU3 | | | | | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | | | > > > > > +-----------------------------------+ | | > > > > > +-----------------------------------+ | | > > > > > | +------+ +------+ +--------------------------+ | > > > > > | | | | | | +-----------+ | | > > > > > | +------+ +------+ | | | | | > > > > > | | | L3 | | | > > > > > | +------+ +------+ +----+ tag | | | > > > > > | | | | | | | | | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | | | > > > > > +-----------------------------------+ | L3 | > > > > > | data | > > > > > +-----------------------------------+ | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | | | | | | | | | > > > > > | +------+ +------+ +----+ L3 | | | > > > > > | | | tag | | | > > > > > | +------+ +------+ | | | | | > > > > > | | | | | ++ +-----------+ | | > > > > > | +------+ +------+ |---------------------------+ | > > > > > +-----------------------------------| | | > > > > > +-----------------------------------| | | > > > > > | +------+ +------+ +---------------------------+ | > > > > > | | | | | | +-----------+ | | > > > > > | +------+ +------+ | | | | | > > > > > | +----+ L3 | | | > > > > > | +------+ +------+ | | tag | | | > > > > > | | | | | | | | | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | | | > > > > > +-----------------------------------+ | | > > > > > +-----------------------------------+ | | > > > > > | +------+ +------+ +--------------------------+ | > > > > > | | | | | | +-----------+ | | > > > > > | +------+ +------+ | | | | | > > > > > | | | L3 | | | > > > > > | +------+ +------+ +---+ tag | | | > > > > > | | | | | | | | | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | | | > > > > > +-----------------------------------+ | | > > > > > +-----------------------------------+ ++ | > > > > > | +------+ +------+ +--------------------------+ | > > > > > | | | | | | +-----------+ | | > > > > > | +------+ +------+ | | | | | > > > > > | | | L3 | | | > > > > > | +------+ +------+ +--+ tag | | | > > > > > | | | | | | | | | | > > > > > | +------+ +------+ | +-----------+ | | > > > > > | | +---------+ > > > > > +-----------------------------------+ > > > > > > > > > > This patch adds the sched_domain for clusters. On kunpeng 920, without > > > > > this patch, domain0 of cpu0 would be MC for cpu0-cpu23 with > > > > > min_interval=24, max_interval=48; with this patch, MC becomes domain1, > > > > > a new domain0 "CL" including cpu0-cpu3 is added with min_interval=4 > and > > > > > max_interval=8. > > > > > This will affect load balance. For example, without this patch, while > > cpu0 > > > > > becomes idle, it will pull a task from cpu1-cpu15. With this patch, > cpu0 > > > > > will try to pull a task from cpu1-cpu3 first. This will have much less > > > > > overhead of task migration. > > > > > > > > > > On the other hand, while doing WAKE_AFFINE, this patch will try to find > > > > > a core in the target cluster before scanning the llc domain. > > > > > This means it will proactively use a core which has better affinity > with > > > > > target core at first. > > > > > > > > Which is at the opposite of what we are usually trying to do in the > > > > fast wakeup path: trying to minimize resource sharing by finding an > > > > idle core with all smt idle as an example > > > > > > In wake_affine case, I guess we are actually want some kind of > > > resource sharing such as LLC to get waker and wakee get closer > > > > In wake_affine, we don't want to move outside the LLC but then in the > > LLC we tries to minimize resource sharing like looking for a core > > fully idle for SMT > > > > > to each other. find_idlest_cpu() is really opposite. > > > > > > So the real question is that LLC is always the right choice of > > > idle sibling? > > > > That's the eternal question: spread or gather > > Indeed. > > > > > > > > > In this case, 6 clusters are in same LLC, but hardware has different > > > behavior for inside single cluster and across multiple clusters. > > > > > > > > > > > > > > > > > > > > Not much benchmark has been done yet. but here is a rough hackbench > > > > > result. > > > > > we run the below command with different -g parameter to increase system > > load > > > > > by changing g from 1 to 4, for each one of 1-4, we run the benchmark > ten > > times > > > > > and record the data to get the average time: > > > > > > > > > > First, we run hackbench in only one NUMA node(cpu0-cpu23): > > > > > $ numactl -N 0 hackbench -p -T -l 100000 -g $1 > > > > > > > > What is your ref tree ? v5.10-rcX or tip/sched/core ? > > > > > > Actually I was using 5.9 release. That must be weird. > > > But the reason is that disk driver is getting hang > > > in my hardware in 5.10-rcx. > > > > In fact there are several changes in v5.10 and tip/sched/core that > > could help your topology > > Will figure out some way to try. > > > > > > > > > > > > > > > > > > > > g=1 (seen cpu utilization around 50% for each core) > > > > > Running in threaded mode with 1 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 7.689 7.485 7.485 7.458 7.524 7.539 7.738 7.693 7.568 7.674=7.5853 > > > > > w/ : 7.516 7.941 7.374 7.963 7.881 7.910 7.420 7.556 7.695 7.441=7.6697 > > > > > performance improvement w/ patch: -1.01% > > > > > > > > > > g=2 (seen cpu utilization around 70% for each core) > > > > > Running in threaded mode with 2 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 10.127 10.119 10.070 10.196 10.057 10.111 10.045 10.164 10.162 > > > > 9.955=10.1006 > > > > > w/ : 9.694 9.654 9.612 9.649 9.686 9.734 9.607 9.842 9.690 9.710=9.6878 > > > > > performance improvement w/ patch: 4.08% > > > > > > > > > > g=3 (seen cpu utilization around 90% for each core) > > > > > Running in threaded mode with 3 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 15.885 15.254 15.932 15.647 16.120 15.878 15.857 15.759 15.674 > > > > 15.721=15.7727 > > > > > w/ : 14.974 14.657 13.969 14.985 14.728 15.665 15.191 14.995 14.946 > > > > 14.895=14.9005 > > > > > performance improvement w/ patch: 5.53% > > > > > > > > > > g=4 > > > > > Running in threaded mode with 4 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 20.014 21.025 21.119 21.235 19.767 20.971 20.962 20.914 21.090 > > > > 21.090=20.8187 > > > > > w/ : 20.331 20.608 20.338 20.445 20.456 20.146 20.693 20.797 21.381 > > > > 20.452=20.5647 > > > > > performance improvement w/ patch: 1.22% > > > > > > > > > > After that, we run the same hackbench in both NUMA nodes(cpu0-cpu47): > > > > > g=1 > > > > > w/o: 7.351 7.416 7.486 7.358 7.516 7.403 7.413 7.411 7.421 7.454=7.4229 > > > > > w/ : 7.609 7.596 7.647 7.571 7.687 7.571 7.520 7.513 7.530 7.681=7.5925 > > > > > performance improvement by patch: -2.2% > > > > > > > > > > g=2 > > > > > w/o: 9.046 9.190 9.053 8.950 9.101 8.930 9.143 8.928 8.905 9.034=9.028 > > > > > w/ : 8.247 8.057 8.258 8.310 8.083 8.201 8.044 8.158 8.382 8.173=8.1913 > > > > > performance improvement by patch: 9.3% > > > > > > > > > > g=3 > > > > > w/o: 11.664 11.767 11.277 11.619 12.557 12.760 11.664 12.165 12.235 > > > > 11.849=11.9557 > > > > > w/ : 9.387 9.461 9.650 9.613 9.591 9.454 9.496 9.716 9.327 9.722=9.5417 > > > > > performance improvement by patch: 20.2% > > > > > > > > > > g=4 > > > > > w/o: 17.347 17.299 17.655 18.775 16.707 18.879 17.255 18.356 16.859 > > > > 18.515=17.7647 > > > > > w/ : 10.416 10.496 10.601 10.318 10.459 10.617 10.510 10.642 10.467 > > > > 10.401=10.4927 > > > > > performance improvement by patch: 40.9% > > > > > > > > > > g=5 > > > > > w/o: 27.805 26.633 24.138 28.086 24.405 27.922 30.043 28.458 31.073 > > > > 25.819=27.4382 > > > > > w/ : 13.817 13.976 14.166 13.688 14.132 14.095 14.003 13.997 13.954 > > > > 13.907=13.9735 > > > > > performance improvement by patch: 49.1% > > > > > > > > > > It seems the patch can bring a huge increase on hackbench especially > when > > > > > we bind hackbench to all of cpu0-cpu47, comparing to 5.53% while running > > > > > on single NUMA node(cpu0-cpu23) > > > > > > > > Interesting that this patch mainly impacts the numa case > > > > > > > > > > > > > > Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx> > > > > > --- > > > > > arch/arm64/Kconfig | 7 +++++++ > > > > > arch/arm64/kernel/smp.c | 17 +++++++++++++++++ > > > > > include/linux/topology.h | 7 +++++++ > > > > > kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ > > > > > 4 files changed, 66 insertions(+) > > > > > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > > index 6d23283..3583c26 100644 > > > > > --- a/arch/arm64/Kconfig > > > > > +++ b/arch/arm64/Kconfig > > > > > @@ -938,6 +938,13 @@ config SCHED_MC > > > > > making when dealing with multi-core CPU chips at a cost of slightly > > > > > increased overhead in some places. If unsure say N here. > > > > > > > > > > +config SCHED_CLUSTER > > > > > + bool "Cluster scheduler support" > > > > > + help > > > > > + Cluster scheduler support improves the CPU scheduler's decision > > > > > + making when dealing with machines that have clusters(sharing > > internal > > > > > + bus or sharing LLC cache tag). If unsure say N here. > > > > > + > > > > > config SCHED_SMT > > > > > bool "SMT scheduler support" > > > > > help > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > index 355ee9e..5c8f026 100644 > > > > > --- a/arch/arm64/kernel/smp.c > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > @@ -32,6 +32,7 @@ > > > > > #include <linux/irq_work.h> > > > > > #include <linux/kexec.h> > > > > > #include <linux/kvm_host.h> > > > > > +#include <linux/sched/topology.h> > > > > > > > > > > #include <asm/alternative.h> > > > > > #include <asm/atomic.h> > > > > > @@ -726,6 +727,20 @@ void __init smp_init_cpus(void) > > > > > } > > > > > } > > > > > > > > > > +static struct sched_domain_topology_level arm64_topology[] = { > > > > > +#ifdef CONFIG_SCHED_SMT > > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > > +#endif > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > + { cpu_clustergroup_mask, cpu_core_flags, SD_INIT_NAME(CL) }, > > > > > +#endif > > > > > +#ifdef CONFIG_SCHED_MC > > > > > + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, > > > > > +#endif > > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > > + { NULL, }, > > > > > +}; > > > > > + > > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > { > > > > > const struct cpu_operations *ops; > > > > > @@ -735,6 +750,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > > > init_cpu_topology(); > > > > > > > > > > + set_sched_topology(arm64_topology); > > > > > + > > > > > this_cpu = smp_processor_id(); > > > > > store_cpu_topology(this_cpu); > > > > > numa_store_cpu_info(this_cpu); > > > > > diff --git a/include/linux/topology.h b/include/linux/topology.h > > > > > index 5f66648..2c823c0 100644 > > > > > --- a/include/linux/topology.h > > > > > +++ b/include/linux/topology.h > > > > > @@ -211,6 +211,13 @@ static inline const struct cpumask *cpu_smt_mask(int > > > > cpu) > > > > > } > > > > > #endif > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > +static inline const struct cpumask *cpu_cluster_mask(int cpu) > > > > > +{ > > > > > + return topology_cluster_cpumask(cpu); > > > > > +} > > > > > +#endif > > > > > + > > > > > static inline const struct cpumask *cpu_cpu_mask(int cpu) > > > > > { > > > > > return cpumask_of_node(cpu_to_node(cpu)); > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index 1a68a05..ae8ec910 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -6106,6 +6106,37 @@ static inline int select_idle_smt(struct > task_struct > > > > *p, int target) > > > > > > > > > > #endif /* CONFIG_SCHED_SMT */ > > > > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > +/* > > > > > + * Scan the local CLUSTER mask for idle CPUs. > > > > > + */ > > > > > +static int select_idle_cluster(struct task_struct *p, int target) > > > > > +{ > > > > > + int cpu; > > > > > + > > > > > + /* right now, no hardware with both cluster and smt to run */ > > > > > + if (sched_smt_active()) > > > > > > > > don't use smt static key but a dedicated one if needed > > > > > > Sure. > > > > > > > > > > > > + return -1; > > > > > + > > > > > + for_each_cpu_wrap(cpu, cpu_cluster_mask(target), target) { > > > > > + if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > > + continue; > > > > > + if (available_idle_cpu(cpu)) > > > > > + return cpu; > > > > > + } > > > > > + > > > > > + return -1; > > > > > +} > > > > > + > > > > > +#else /* CONFIG_SCHED_CLUSTER */ > > > > > + > > > > > +static inline int select_idle_cluster(struct task_struct *p, int target) > > > > > +{ > > > > > + return -1; > > > > > +} > > > > > + > > > > > +#endif /* CONFIG_SCHED_CLUSTER */ > > > > > + > > > > > /* > > > > > * Scan the LLC domain for idle CPUs; this is dynamically regulated > by > > > > > * comparing the average scan cost (tracked in sd->avg_scan_cost) against > > > > the > > > > > @@ -6270,6 +6301,10 @@ static int select_idle_sibling(struct task_struct > > *p, > > > > int prev, int target) > > > > > if ((unsigned)i < nr_cpumask_bits) > > > > > return i; > > > > > > > > > > + i = select_idle_cluster(p, target); > > > > > + if ((unsigned)i < nr_cpumask_bits) > > > > > + return i; > > > > > > > > This is yet another loop in the fast wake up path. > > > > > > > > I'm curious to know which part of this patch really gives the perf > improvement ? > > > > -Is it the new sched domain level with a shorter interval that is then > > > > used by Load balance to better spread task in the cluster and between > > > > clusters ? > > > > -Or this new loop in the wake up path which tries to keep threads in > > > > the same cluster ? which is at the opposite of the rest of the > > > > scheduler which tries to spread > > > > > > If I don't scan cluster first for wake_affine, I almost don't see large > > > hackbench change by the new sche_domain. > > > For example: > > > g=4 in hackbench on cpu0-cpu47(two numa) > > > w/o patch: 17.7647 (average time in 10 times of hackbench) > > > w/ the full patch: 10.4927 > > > w/ patch but drop select_idle_cluster(): 15.0931 > > > > And for the case with one numa node ? > > That would be very frustrating as it is getting worse: > > g=1 > Running in threaded mode with 1 groups using 40 file descriptors > Each sender will pass 100000 messages of 100 bytes > w/o: 7.689 7.485 7.485 7.458 7.524 7.539 7.738 7.693 7.568 7.674=7.5853 > w/ : 7.516 7.941 7.374 7.963 7.881 7.910 7.420 7.556 7.695 7.441=7.6697 > w/ but dropped select_idle_cluster: > 7.816 7.589 7.319 7.556 7.443 7.459 7.636 7.427 7.425 7.395=7.5065 > > g=2 > Running in threaded mode with 2 groups using 40 file descriptors > Each sender will pass 100000 messages of 100 bytes > w/o: 10.127 10.119 10.070 10.196 10.057 10.111 10.045 10.164 10.162 > 9.955=10.1006 > w/ : 9.694 9.654 9.612 9.649 9.686 9.734 9.607 9.842 9.690 9.710=9.6878 > w/ but dropped select_idle_cluster: > 10.222 10.078 10.063 10.317 9.963 10.060 10.089 9.934 10.152 10.077=10.0955 > > g=3 > Running in threaded mode with 3 groups using 40 file descriptors > Each sender will pass 100000 messages of 100 bytes > w/o: 15.885 15.254 15.932 15.647 16.120 15.878 15.857 15.759 15.674 > 15.721=15.7727 > w/ : 14.974 14.657 13.969 14.985 14.728 15.665 15.191 14.995 14.946 > 14.895=14.9005 > w/ but dropped select_idle_cluster(getting worse than w/o): > 16.892 16.962 17.248 17.392 17.336 17.705 17.113 17.633 17.477 > 17.378=17.3136 > > g=4 > Running in threaded mode with 4 groups using 40 file descriptors > Each sender will pass 100000 messages of 100 bytes > w/o: 20.014 21.025 21.119 21.235 19.767 20.971 20.962 20.914 21.090 > 21.090=20.8187 > w/ : 20.331 20.608 20.338 20.445 20.456 20.146 20.693 20.797 21.381 > 20.452=20.5647 > w/ but dropped select_idle_cluster(getting worse than w/o): > 24.075 24.122 24.243 24.000 24.223 23.791 23.246 24.904 23.990 > 24.431=24.1025 Sorry. Please ignore this. I added some printk here while testing one numa. Will update you the data in another email. Thanks Barry