* Ingo Molnar <mingo@xxxxxxx> wrote: > > * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > Now, there is one advantage to inlining that I really like - and > > that I personally have missed from the days when we used to > > always do it: profiling. Inlining the spinlocks tends to do > > _wonders_ for showing exactly which spinlock ends up being the > > contention cases for certain loads. [...] > > We've got a new tool to determine that kind of information though: > perf + call-graph profiling. > > This is how it works and how it can be used. Say we want to profile > 'bw_tcp' from lmbench. The traditional way is to record a flat > profile: > > aldebaran:~/l> perf record -f ./bw_pipe > Pipe bandwidth: 4607.33 MB/sec > [ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ] > > and then to view the flat profile: > > aldebaran:~/l> perf report > # Samples: 109331 > # > # Overhead Command Shared Object Symbol > # ........ ....... ........................ ...... > # > 39.61% bw_pipe [kernel] [k] copy_user_generic_string > 19.32% bw_pipe ./bw_pipe [.] enough_duration > 6.21% bw_pipe [kernel] [k] mutex_spin_on_owner > 3.54% bw_pipe ./bw_pipe [.] two_op > 3.54% bw_pipe ./bw_pipe [.] one_op > 2.79% bw_pipe [kernel] [k] get_page_from_freelist > 2.16% bw_pipe [kernel] [k] ia32_sysenter_target > 1.91% bw_pipe [kernel] [k] pipe_read > 1.83% bw_pipe [kernel] [k] pipe_write > 1.63% bw_pipe [kernel] [k] free_hot_cold_page > 1.21% bw_pipe [kernel] [k] put_page > 1.19% bw_pipe [kernel] [k] __alloc_pages_nodemask > 0.99% bw_pipe [kernel] [k] anon_pipe_buf_release > 0.74% bw_pipe [kernel] [k] get_pageblock_flags_group > > Annoyingly enough, mutex_spin_on_owner() is in the profile but we > have no idea what the parent function(s) are. (Plus there's no way > we could reasonably inline mutex functions.) So we are stuck with > guessing which of the many other functions above (or some other > function which does not even get listed!) has that mutex spinning > overhead. > > But if you do the recording via call-graph traces: > > aldebaran:~/l> perf record -g -f ./bw_pipe > Pipe bandwidth: 4432.75 MB/sec > [ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ] > > (note the -g/--graph option) And then view the profile via perf > report, we get: sorry - mail got trimmed due to a cut&paste mishap. Here's the missing bw_pipe profile: # Samples: 81007 # # Overhead Command Shared Object Symbol # ........ ....... ........................ ...... # 7.60% bw_pipe [kernel] [k] mutex_spin_on_owner | |--99.97%-- __mutex_lock_common | __mutex_lock_slowpath | mutex_lock | | | |--98.46%-- pipe_write | | do_sync_write | | vfs_write | | sys_write | | sysenter_dispatch | | 0xf7f4f430 | | 0x600000007 | | | --1.54%-- pipe_lock | pipe_wait | pipe_write | do_sync_write | vfs_write | sys_write | sysenter_dispatch | 0xf7f4f430 | 0x600000007 --0.03%-- [...] it shows that the mutex overhead is distributed between these two functions: | |--98.46%-- pipe_write | --1.54%-- pipe_lock I also wanted to show a more complex example, how spin-lock overhead gets distributed in the networking stack, in a "lat_tcp localhost" workload: > > aldebaran:~/l> perf report -S _spin_lock > > # symbol: _spin_lock > # Samples: 724 > # > # Overhead Command Shared Object > # ........ ....... ........................ > # > 98.34% lat_tcp [kernel] > | > |--31.74%-- ipt_do_table > | | > | |--26.99%-- ipt_local_in_hook > | | nf_iterate > | | nf_hook_slow > | | ip_local_deliver > | | ip_rcv_finish > | | ip_rcv > | | netif_receive_skb > | | process_backlog > | | net_rx_action > | | __do_softirq > | | call_softirq > | | do_softirq > | | _local_bh_enable_ip > | | local_bh_enable > | | dev_queue_xmit > | | ip_finish_output2 > | | ip_finish_output > | | ip_output > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | |--24.34%-- ipt_local_in_hook > | | nf_iterate > | | nf_hook_slow > | | ip_local_deliver > | | ip_rcv_finish > | | ip_rcv > | | netif_receive_skb > | | process_backlog > | | net_rx_action > | | __do_softirq > | | call_softirq > | | do_softirq > | | _local_bh_enable_ip > | | local_bh_enable > | | dev_queue_xmit > | | ip_finish_output2 > | | ip_finish_output > | | ip_output > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | |--16.81%-- ipt_pre_routing_hook > | | nf_iterate > | | nf_hook_slow > | | ip_rcv > | | netif_receive_skb > | | process_backlog > | | net_rx_action > | | __do_softirq > | | call_softirq > | | do_softirq > | | _local_bh_enable_ip > | | local_bh_enable > | | dev_queue_xmit > | | ip_finish_output2 > | | ip_finish_output > | | ip_output > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | |--11.95%-- ipt_post_routing_hook > | | nf_iterate > | | nf_hook_slow > | | ip_output > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | |--10.18%-- ipt_local_out_hook > | | nf_iterate > | | nf_hook_slow > | | __ip_local_out > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | --9.73%-- ipt_local_hook > | nf_iterate > | nf_hook_slow > | __ip_local_out > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--31.46%-- __mod_timer > | mod_timer > | sk_reset_timer > | | > | |--55.36%-- tcp_send_delayed_ack > | | __tcp_ack_snd_check > | | tcp_rcv_established > | | tcp_v4_do_rcv > | | tcp_prequeue_process > | | tcp_recvmsg > | | sock_common_recvmsg > | | __sock_recvmsg > | | sock_aio_read > | | do_sync_read > | | vfs_read > | | sys_read > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | --44.64%-- tcp_v4_rcv > | ip_local_deliver_finish > | ip_local_deliver > | ip_rcv_finish > | ip_rcv > | netif_receive_skb > | process_backlog > | net_rx_action > | __do_softirq > | call_softirq > | do_softirq > | _local_bh_enable_ip > | local_bh_enable > | dev_queue_xmit > | ip_finish_output2 > | ip_finish_output > | ip_output > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--21.21%-- task_rq_lock > | try_to_wake_up > | default_wake_function > | | > | |--99.34%-- autoremove_wake_function > | | __wake_up_common > | | __wake_up > | | tcp_v4_rcv > | | ip_local_deliver_finish > | | ip_local_deliver > | | ip_rcv_finish > | | ip_rcv > | | netif_receive_skb > | | process_backlog > | | net_rx_action > | | __do_softirq > | | call_softirq > | | do_softirq > | | _local_bh_enable_ip > | | local_bh_enable > | | dev_queue_xmit > | | ip_finish_output2 > | | ip_finish_output > | | ip_output > | | ip_local_out > | | ip_queue_xmit > | | tcp_transmit_skb > | | tcp_write_xmit > | | __tcp_push_pending_frames > | | tcp_push > | | tcp_sendmsg > | | __sock_sendmsg > | | sock_aio_write > | | do_sync_write > | | vfs_write > | | sys_write > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | --0.66%-- pollwake > | __wake_up_common > | __wake_up > | perf_counter_wakeup > | perf_pending_counter > | perf_counter_do_pending > | smp_perf_pending_interrupt > | perf_pending_interrupt > | 0xf7f50430 > | 0x5 > | > |--5.06%-- tcp_v4_rcv > | ip_local_deliver_finish > | ip_local_deliver > | ip_rcv_finish > | ip_rcv > | netif_receive_skb > | process_backlog > | net_rx_action > | __do_softirq > | call_softirq > | do_softirq > | _local_bh_enable_ip > | local_bh_enable > | dev_queue_xmit > | ip_finish_output2 > | ip_finish_output > | ip_output > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--4.07%-- __perf_counter_sched_out > | perf_counter_task_sched_out > | schedule > | schedule_timeout > | sk_wait_data > | tcp_recvmsg > | sock_common_recvmsg > | __sock_recvmsg > | sock_aio_read > | do_sync_read > | vfs_read > | sys_read > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--1.40%-- ipt_local_hook > | nf_iterate > | nf_hook_slow > | __ip_local_out > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--1.26%-- mod_timer > | sk_reset_timer > | | > | |--66.67%-- tcp_send_delayed_ack > | | __tcp_ack_snd_check > | | tcp_rcv_established > | | tcp_v4_do_rcv > | | tcp_prequeue_process > | | tcp_recvmsg > | | sock_common_recvmsg > | | __sock_recvmsg > | | sock_aio_read > | | do_sync_read > | | vfs_read > | | sys_read > | | sysenter_dispatch > | | 0xf7f50430 > | | 0x5 > | | > | --33.33%-- tcp_v4_rcv > | ip_local_deliver_finish > | ip_local_deliver > | ip_rcv_finish > | ip_rcv > | netif_receive_skb > | process_backlog > | net_rx_action > | __do_softirq > | call_softirq > | do_softirq > | _local_bh_enable_ip > | local_bh_enable > | dev_queue_xmit > | ip_finish_output2 > | ip_finish_output > | ip_output > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--0.70%-- ipt_local_in_hook > | nf_iterate > | nf_hook_slow > | ip_local_deliver > | ip_rcv_finish > | ip_rcv > | netif_receive_skb > | process_backlog > | net_rx_action > | __do_softirq > | call_softirq > | do_softirq > | _local_bh_enable_ip > | local_bh_enable > | dev_queue_xmit > | ip_finish_output2 > | ip_finish_output > | ip_output > | ip_local_out > | ip_queue_xmit > | tcp_transmit_skb > | tcp_write_xmit > | __tcp_push_pending_frames > | tcp_push > | tcp_sendmsg > | __sock_sendmsg > | sock_aio_write > | do_sync_write > | vfs_write > | sys_write > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > | > |--0.56%-- perf_counter_task_sched_out > | schedule > | schedule_timeout > | sk_wait_data > | tcp_recvmsg > | sock_common_recvmsg > | __sock_recvmsg > | sock_aio_read > | do_sync_read > | vfs_read > | sys_read > | sysenter_dispatch > | 0xf7f50430 > | 0x5 > --2.53%-- [...] > > 1.66% perf [kernel] > | > |--25.00%-- journal_stop > | __ext3_journal_stop > | ext3_writeback_write_end > | generic_perform_write > | generic_file_buffered_write > | __generic_file_aio_write_nolock > | generic_file_aio_write > | ext3_file_write > | do_sync_write > | vfs_write > | sys_write > | system_call_fastpath > | __write_nocancel > | > |--16.67%-- __perf_counter_sched_out > | perf_counter_task_sched_out > | schedule > | schedule_hrtimeout_range > | poll_schedule_timeout > | do_poll > | do_sys_poll > | sys_poll > | system_call_fastpath > | __GI___poll > | > |--8.33%-- dup_mmap > | dup_mm > | copy_process > | do_fork > | sys_clone > | stub_clone > | fork > | __cmd_record > | > |--8.33%-- hrtimer_interrupt > | smp_apic_timer_interrupt > | apic_timer_interrupt > | __write_nocancel > | > |--8.33%-- journal_start > | ext3_journal_start_sb > | ext3_write_begin > | generic_perform_write > | generic_file_buffered_write > | __generic_file_aio_write_nolock > | generic_file_aio_write > | ext3_file_write > | do_sync_write > | vfs_write > | sys_write > | system_call_fastpath > | __write_nocancel > | > |--8.33%-- create_empty_buffers > | __block_prepare_write > | block_write_begin > | ext3_write_begin > | generic_perform_write > | generic_file_buffered_write > | __generic_file_aio_write_nolock > | generic_file_aio_write > | ext3_file_write > | do_sync_write > | vfs_write > | sys_write > | system_call_fastpath > | __write_nocancel > | > |--8.33%-- perf_counter_task_sched_out > | schedule > | schedule_hrtimeout_range > | poll_schedule_timeout > | do_poll > | do_sys_poll > | sys_poll > | system_call_fastpath > | __GI___poll > | > |--8.33%-- ext3_alloc_branch > | ext3_get_blocks_handle > | ext3_get_block > | __block_prepare_write > | block_write_begin > | ext3_write_begin > | generic_perform_write > | generic_file_buffered_write > | __generic_file_aio_write_nolock > | generic_file_aio_write > | ext3_file_write > | do_sync_write > | vfs_write > | sys_write > | system_call_fastpath > | __write_nocancel > | > --8.33%-- start_this_handle > journal_start > ext3_journal_start_sb > ext3_write_begin > generic_perform_write > generic_file_buffered_write > __generic_file_aio_write_nolock > generic_file_aio_write > ext3_file_write > do_sync_write > vfs_write > sys_write > system_call_fastpath > __write_nocancel > > the graph above (which is the sub-graph of the full graph, limited > via the -S/--symbols option to all callchains that intersect with > the _spin_lock symbol) details all the various code-paths that make > use of _spin_lock(), and the weights (and relative weights) of > those codepaths. > > Note that this works for just about any locking primitive, and note > how it's recursive. If we inlined spin_lock we'd only know this > __mod_timer overhead: > > |--31.46%-- __mod_timer > > but we wouldnt know the second level (which is interesting too): > > | |--55.36%-- tcp_send_delayed_ack > | --44.64%-- tcp_v4_rcv > > All in one, i think we should detach the issue of instrumentation > and profiling from the issue of whether to inline or not. We should > inline where it makes things faster. Our profiling tools are able > to deal with it, independently of whether we inline or not. > > Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html