* Ingo Molnar <mingo@xxxxxxx> wrote: > Which gave these overall stats: > > Performance counter stats for './prctl 0 0': > > 28414.696319 task-clock-msecs # 0.997 CPUs > 3 context-switches # 0.000 M/sec > 1 CPU-migrations # 0.000 M/sec > 149 page-faults # 0.000 M/sec > 87254432334 cycles # 3070.750 M/sec > 5078691161 instructions # 0.058 IPC > 304144 cache-references # 0.011 M/sec > 28760 cache-misses # 0.001 M/sec > > 28.501962853 seconds time elapsed. > > 87254432334/1000000000 ~== 87, so we have 87 cycles cost per > iteration. I also measured the GUP based copy_from_user_nmi(), on 64-bit (so there's not even any real atomic-kmap/invlpg overhead): Performance counter stats for './prctl 0 0': 55580.513882 task-clock-msecs # 0.997 CPUs 3 context-switches # 0.000 M/sec 1 CPU-migrations # 0.000 M/sec 149 page-faults # 0.000 M/sec 176375680192 cycles # 3173.337 M/sec 299353138289 instructions # 1.697 IPC 3388060 cache-references # 0.061 M/sec 1318977 cache-misses # 0.024 M/sec 55.748468367 seconds time elapsed. This shows the overhead of looking up pagetables - 176 cycles per iteration. A cr2 save/restore pair is twice as fast. Here's the profile btw: aldebaran:~> perf report -s s # # (1813480 samples) # # Overhead Symbol # ........ ...... # 23.99% [k] __get_user_pages_fast 19.89% [k] gup_pte_range 18.98% [k] gup_pud_range 16.95% [k] copy_from_user_nmi 16.04% [k] put_page 3.17% [k] sys_prctl 0.02% [k] _spin_lock 0.02% [k] copy_user_generic_string 0.02% [k] get_page_from_freelist taking a look at 'perf annotate __get_user_pages_fast' suggests these two hot-spots: 0.04 : ffffffff810310cc: 9c pushfq 9.24 : ffffffff810310cd: 41 5d pop %r13 1.43 : ffffffff810310cf: fa cli 3.44 : ffffffff810310d0: 48 89 fb mov %rdi,%rbx 0.00 : ffffffff810310d3: 4d 8d 7e ff lea -0x1(%r14),%r15 0.00 : ffffffff810310d7: 48 c1 eb 24 shr $0x24,%rbx 0.00 : ffffffff810310db: 81 e3 f8 0f 00 00 and $0xff8,%ebx 15% of its overhead is here, 50% is here: 0.71 : ffffffff81031141: 41 55 push %r13 0.05 : ffffffff81031143: 9d popfq 30.07 : ffffffff81031144: 8b 55 d4 mov -0x2c(%rbp),%edx 2.78 : ffffffff81031147: 48 83 c4 20 add $0x20,%rsp 0.00 : ffffffff8103114b: 89 d0 mov %edx,%eax 10.93 : ffffffff8103114d: 5b pop %rbx 0.02 : ffffffff8103114e: 41 5c pop %r12 1.28 : ffffffff81031150: 41 5d pop %r13 0.51 : ffffffff81031152: 41 5e pop %r14 So either pushfq+cli...popfq sequences are a lot more expensive on Nehalem as i imagined, or instruction skidding is tricking us here. gup_pte_range has a clear hotspot with a locked instruction: 2.46 : ffffffff81030d88: 48 8d 41 08 lea 0x8(%rcx),%rax 0.00 : ffffffff81030d8c: f0 ff 41 08 lock incl 0x8(%rcx) 53.52 : ffffffff81030d90: 49 63 01 movslq (%r9),%rax 0.00 : ffffffff81030d93: 48 81 c6 00 10 00 00 add $0x1000,%rsi 11% of the total overhead - or about 19 cycles. So it seems cr2+direct-access is distinctly faster than fast-gup. And fast-gup overhead is _per frame entry_ - which makes cr2+direct-access (which is per NMI) _far_ more performant - a dozen or more call-chain entries are the norm. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html