Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 5 Mar 2015 10:00:45 +1100

On Wed, Mar 04, 2015 at 08:00:46PM +0000, Mel Gorman wrote:
> On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
> > On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote:
> > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > >>
> > > > > >> But are those migrate-page calls really common enough to make these
> > > > > >> things happen often enough on the same pages for this all to matter?
> > > > > >
> > > > > > It's looking like that's a possibility.
> > > > > 
> > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > > > re-introduced the "pte was already NUMA" case.
> > > > > 
> > > > > So that's not it either, afaik. Plus your numbers seem to say that
> > > > > it's really "migrate_pages()" that is done more. So it feels like the
> > > > > numa balancing isn't working right.
> > > > 
> > > > So that should show up in the vmstats, right? Oh, and there's a
> > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > > > 
> > > 
> > > The stats indicate both more updates and more faults. Can you try this
> > > please? It's against 4.0-rc1.
> > > 
> > > ---8<---
> > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
> > 
> > Makes no noticable difference to behaviour or performance. Stats:
> > 
> 
> After going through the series again, I did not spot why there is a
> difference. It's functionally similar and I would hate the theory that
> this is somehow hardware related due to the use of bits it takes action
> on.

I doubt it's hardware related - I'm testing inside a VM, and the
host is a year old Dell r820 server, so it's a pretty common
hardware I'd think.

Guest:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 6
model name      : QEMU Virtual CPU version 2.0.0
stepping        : 3
microcode       : 0x1
cpu MHz         : 2199.998
cache size      : 4096 KB
physical id     : 15
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic popcnt hypervisor lahf_lm
bugs            :
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Host:

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping        : 7
microcode       : 0x70d
cpu MHz         : 1190.750
cache size      : 16384 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 47
initial apicid  : 47
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4400.75
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

> There is nothing in the manual that indicates that it would. Try this
> as I don't want to leave this hanging before LSF/MM because it'll mask other
> reports. It alters the maximum rate automatic NUMA balancing scans ptes.
> 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..40ae5d84d4ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * calculated based on the tasks virtual memory size and
>   * numa_balancing_scan_size.
>   */
> -unsigned int sysctl_numa_balancing_scan_period_min = 1000;
> +unsigned int sysctl_numa_balancing_scan_period_min = 2000;
>  unsigned int sysctl_numa_balancing_scan_period_max = 60000;

Made absolutely no difference:

	357,635      migrate:mm_migrate_pages      ( +-  4.11% )

numa_hit 36724642
numa_miss 92477
numa_foreign 92477
numa_interleave 11835
numa_local 36709671
numa_other 107448
numa_pte_updates 83924860
numa_huge_pte_updates 0
numa_hint_faults 81856035
numa_hint_faults_local 22104529
numa_pages_migrated 32766735
pgmigrate_success 32766735
pgmigrate_fail 0

Runtime was actually a minute worse (18m35s vs 17m39s) than without
this patch.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>