* Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxx> wrote: > * Ingo Molnar (mingo@xxxxxxx) wrote: > > > > * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > On Mon, 15 Jun 2009, Ingo Molnar wrote: > > > > > > > > See the numbers in the other mail: about 33 million pagefaults > > > > happen in a typical kernel build - that's ~400K/sec - and that > > > > is not a particularly really pagefault-heavy workload. > > > > > > Did you do any function-level profiles? > > > > > > Last I looked at it, the real cost of page faults were all in the > > > memory copies and page clearing, and while it would be nice to > > > speed up the kernel entry and exit, the few tens of cycles we > > > might be able to get from there really aren't all that important. > > > > Yeah. > > > > Here's the function level profiles of a typical kernel build on a > > Nehalem box: > > > > $ perf report --sort symbol > > > > # > > # (14317328 samples) > > # > > # Overhead Symbol > > # ........ ...... > > # > > 44.05% 0x000000001a0b80 > > It makes me wonder how the following scenario is accounted : > > - Execution of a newly forked/exec'd process instruction causes a > fault. (traps, faults and interrupts can take roughly 2000 > cycles to execute) > > - PC sampling interrupt fires. > > Will it account the execution time as part of user-space or > kernel-space execution ? "It depends". With call-chain profiling ("perf record --call-graph" + "perf report --sort parent") this will show up as: # # (19525018 samples) # # Overhead Parent symbol # ........ .................... # 88.60% [other] 4.96% do_page_fault 1.74% sys_write 1.18% sys_openat 0.75% sys_exit_group 0.74% sys_execve 0.43% sys_read 0.28% sys_mmap 0.23% sys_clone 0.21% sys_close 0.17% sys_munmap 0.15% sys_poll 0.13% sys_lstat 0.09% sys_faccessat 0.05% sys_mprotect This line: 4.96% do_page_fault Is the summed up overhead of all things page faults. If you sort by a specific user-space symbol, then _its_ own generated page-faults will be displayed. Say, you profile 'git gc' done in Git's repo with 10 KHz: perf record -g -f -F 10000 -- ./git gc Raw outline of overhead categories: $ perf report --sort parent # # Overhead Parent symbol # ........ .................... # 96.97% [other] 1.32% do_page_fault 0.54% sys_write 0.21% sys_exit_group 0.15% sys_open 0.14% sys_execve 0.13% sys_mmap 0.11% sys_poll 0.10% sys_clone Note that do_page_fault has 1.32% total overhead there. But if you only look at main's overhead: # # Overhead Symbol # ........ ...... # 33.12% [.] lookup_object 11.17% [.] __GI_strlen 5.14% [.] decode_tree_entry 2.94% [.] __GI_memcpy 2.58% [.] find_pack_entry_one 2.30% [.] lookup_blob 1.61% [.] tree_entry 1.16% [.] process_tree .... 0.08% [k] page_fault 0.02% [k] do_page_fault 0.02% [k] page_fault 0.02% [k] filemap_fault 0.02% [k] __do_fault 0.01% [k] handle_mm_fault The page fault overhead is down the bottom. Why? Because most pagefaults are not raised by 'main', but by the dynamic loader which runs sooner than that. > Depending on how the sampling mechanism finds out if it is running > in kernel mode or userspace mode, this might make the userspace PC > appear as currently running even though the current execution > context is the very beginning of the page fault handler (1st > instruction servicing the fault). It's much more nuanced than a binary 'user-space' versus 'kernel-space' decision. A true 'raw' call-chain looks like this: 0x25b0 [0x108]: PERF_EVENT (IP, 5): 3455: 0xffffffff810b63ad period: 310083 ... chain: nr:28 ..... 0: ffffffffffffff80 ..... 1: ffffffff810b63ad ..... 2: ffffffff81018258 ..... 3: ffffffff810aeddb ..... 4: ffffffff810af14d ..... 5: ffffffff81019042 ..... 6: ffffffff8153245e ..... 7: ffffffff81533783 ..... 8: ffffffff815337cd ..... 9: ffffffff8105fc8c ..... 10: ffffffff81531c2a ..... 11: ffffffff81531e0e ..... 12: ffffffff8153174a ..... 13: ffffffff810b68aa ..... 14: ffffffff810daa24 ..... 15: ffffffff810c558e ..... 16: ffffffff810c78e9 ..... 17: ffffffff81533739 ..... 18: ffffffff815314ff ..... 19: ffffffff810b1716 ..... 20: ffffffff810b22a2 ..... 21: ffffffff810e5586 ..... 22: ffffffff810e6080 ..... 23: ffffffff810e61a8 ..... 24: ffffffff8100bd9b ..... 25: fffffffffffffe00 ..... 26: 0000003641ed6590 ..... 27: 0000003646e046b3 ... thread: git:3455 ...... dso: [kernel] 25 kernel-context RIPs followed by a context separator (fffffffffffffe00) followed by two user-space RIPs. So whether this is kernel-space or user-space sample depends on the analysis stage - how you decide to look at it via perf report. If you only look at the top surface via 'perf report --sort symbol' it's a "kernel-space" sample. If you look deeper, it could be a user-space one too. The full list of contexts is: enum perf_callchain_context { PERF_CONTEXT_HV = (__u64)-32, PERF_CONTEXT_KERNEL = (__u64)-128, PERF_CONTEXT_USER = (__u64)-512, PERF_CONTEXT_GUEST = (__u64)-2048, PERF_CONTEXT_GUEST_KERNEL = (__u64)-2176, PERF_CONTEXT_GUEST_USER = (__u64)-2560, PERF_CONTEXT_MAX = (__u64)-4095, }; and a call-chain can in theory include all of these, in a nice stack of call-chain entries. ( Btw., we are planning to adding context separators for IRQ and softirq contexts as well - to be able to isolate hardirq and softirq workloads (separated away from the mostly unrelated syscall level and user-level execution overhead). ) Hope this answers your questions, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html