Re: [PATCH 00/27] Latest numa/core release, v16

Ingo Molnar <mingo@xxxxxxxxxx> · Tue, 20 Nov 2012 10:06:37 +0100

* David Rientjes <rientjes@xxxxxxxxxx> wrote:

> On Tue, 20 Nov 2012, Ingo Molnar wrote:
> 
> > > This happened to be an Opteron (but not 83xx series), 2.4Ghz.  
> > 
> > Ok - roughly which family/model from /proc/cpuinfo?
> 
> It's close enough, it's 23xx.

Ok - which family/model number in /proc/cpuinfo?

I'm asking because that will matter most to page fault 
micro-characteristics and the 23xx series existed in Barcelona 
form as well (family/model of 16/2) and it still exists in its 
current Shanghai form as well.

My guess is Barcelona 16/2?

If that is correct then the closest I can get to your topology 
is a 4-socket 32-way Opteron system with 32 GB of RAM - which 
seems close enough for testing purposes.

But checking numa/core on such a system still keeps me 
absolutely puzzled, as I get the following with a similar 
16-warehouses SPECjbb 2005 test, using java -Xms8192m -Xmx8192m 
-Xss256k sizing, THP enabled, 2x 240 seconds runs (I tried to 
configure it all very close to yours), using -tip-a07005cbd847:

         kernel             warehouses    transactions/sec
         ---------          ----------
         v3.7-rc6:          16            197802
                            16            197997

         numa/core:         16            203086
                            16            203967

So sadly numa/core is about 2%-3% faster on this 4x4 system too!
:-/

But I have to say, your SPECjbb score is uncharacteristically 
low even for an oddball-topology Barcelona system - which is the 
oldest/slowest system I can think of. So there might be more to 
this.

To further characterise a "good" SPECjbb run, there's no 
page_fault overhead visible in perf top:

Mainline profile:

    94.99%  perf-1244.map     [.] 0x00007f04cd1aa523                 
     2.52%  libjvm.so         [.] 0x00000000007004a1                 
     0.62%  [vdso]            [.] 0x0000000000000972                 
     0.31%  [kernel]          [k] clear_page_c                       
     0.17%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.11%  [kernel]          [k] rep_nop                            
     0.09%  [kernel]          [k] ktime_get                          
     0.08%  [kernel]          [k] get_cycles                         
     0.06%  [kernel]          [k] read_tsc                           
     0.05%  libc-2.15.so      [.] __strcmp_sse2                      

numa/core profile:

    95.66%  perf-1201.map     [.] 0x00007fe4ad1c8fc7                 
     1.70%  libjvm.so         [.] 0x0000000000381581                 
     0.59%  [vdso]            [.] 0x0000000000000607                 
     0.19%  [kernel]          [k] do_raw_spin_lock                   
     0.11%  [kernel]          [k] generic_smp_call_function_interrupt
     0.11%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.08%  [kernel]          [k] ktime_get                          
     0.06%  [kernel]          [k] get_cycles                         
     0.05%  [kernel]          [k] __native_flush_tlb                 
     0.05%  [kernel]          [k] rep_nop                            
     0.04%  perf              [.] add_hist_entry.isra.9              
     0.04%  [kernel]          [k] rcu_check_callbacks                
     0.04%  [kernel]          [k] ktime_get_update_offsets           
     0.04%  libc-2.15.so      [.] __strcmp_sse2                      

No page fault overhead (see the page fault rate further below) - 
the NUMA scanning overhead shows up only through some mild TLB 
flush activity (which I'll fix btw).

[ Stupid question: cpufreq is configured to always-2.4GHz, 
  right? If you could send me your kernel config (you can do 
  that privately as well) then I can try to boot it and see. ]

> > > It's perf top -U, the benchmark itself was unchanged so I 
> > > didn't think it was interesting to gather the user 
> > > symbols.  If that would be helpful, let me know!
> > 
> > Yeah, regular perf top output would be very helpful to get a 
> > general sense of proportion. Thanks!
> 
> Ok, here it is:
> 
>     91.24%  perf-10971.map    [.] 0x00007f116a6c6fb8                            
>      1.19%  libjvm.so         [.] instanceKlass::oop_push_contents(PSPromotionMa
>      1.04%  libjvm.so         [.] PSPromotionManager::drain_stacks_depth(bool)  
>      0.79%  libjvm.so         [.] PSPromotionManager::copy_to_survivor_space(oop
>      0.60%  libjvm.so         [.] PSPromotionManager::claim_or_forward_internal_
>      0.58%  [kernel]          [k] page_fault                                    
>      0.28%  libc-2.3.6.so     [.] __gettimeofday                                        
>      0.26%  libjvm.so         [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
>      0.22%  [kernel]          [k] getnstimeofday                                        
>      0.18%  libjvm.so         [.] CardTableExtension::scavenge_contents_parallel(ObjectS
>      0.15%  [kernel]          [k] _raw_spin_lock                                        
>      0.12%  [kernel]          [k] ktime_get_update_offsets                              
>      0.11%  [kernel]          [k] ktime_get                                             
>      0.11%  [kernel]          [k] rcu_check_callbacks                                   
>      0.10%  [kernel]          [k] generic_smp_call_function_interrupt                   
>      0.10%  [kernel]          [k] read_tsc                                              
>      0.10%  [kernel]          [k] clear_page_c                                          
>      0.10%  [kernel]          [k] __do_page_fault                                       
>      0.08%  [kernel]          [k] handle_mm_fault                                       
>      0.08%  libjvm.so         [.] os::javaTimeMillis()                                  
>      0.08%  [kernel]          [k] emulate_vsyscall                                      

Oh, finally a clue: you seem to have vsyscall emulation 
overhead!

Vsyscall emulation is fundamentally page fault driven - which 
might explain why you are seeing page fault overhead. It might 
also interact with other sources of faults - such as numa/core's 
working set probing ...

Many JVMs try to be smart with the vsyscall. As a test, does the 
vsyscall=native boot option change the results/behavior in any 
way?

Stupid question, if you apply the patch attached below and if 
you do page fault profiling while the run is in steady state:

   perf record -e faults -g -a sleep 10

do you see it often coming from the vsyscall page?

Also, this:

   perf stat -e faults -a --repeat 10 sleep 1

should normally report something like this during SPECjbb steady 
state, numa/core:

          warmup:   3,895 faults/sec                ( +- 12.11% )
    steady state:   3,910 faults/sec                ( +-  6.72% )

Which is about 250 faults/sec/CPU - i.e. it should be barely 
recognizable in profiles - let alone be prominent as in yours.

Thanks,

	Ingo

---
 arch/x86/mm/fault.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================

--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -1030,6 +1030,9 @@ __do_page_fault(struct pt_regs *regs, un
 	/* Get the faulting address: */
 	address = read_cr2();
 
+	/* Instrument as early as possible: */
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
 	/*
 	 * Detect and handle instructions that would cause a page fault for
 	 * both a tracked kernel page and a userspace page.
@@ -1107,8 +1110,6 @@ __do_page_fault(struct pt_regs *regs, un
 		}
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	/*
 	 * If we're in an interrupt, have no user context or are running
 	 * in an atomic region then we must not take the fault:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>