On Wed, Mar 16, 2022 at 03:45:46PM -0700, Namhyung Kim wrote: > Hello, > > There have been some requests for low-overhead kernel lock contention > monitoring. The kernel has CONFIG_LOCK_STAT to provide such an infra > either via /proc/lock_stat or tracepoints directly. > > However it's not light-weight and hard to be used in production. So > I'm trying to add new tracepoints for lock contention and using them > as a base to build a new monitoring system. Hello Namhyung, I like this series so much. Lock contentions became much more visible without serious overhead. For the series: Tested-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> How would you use these tracepoints, is there a script you use? For testing, I just wrote a simple bpftrace script: $ sudo bpftrace -e 'BEGIN { printf("Collecting lockstats... Hit Ctrl-C to end.\n"); } tracepoint:lock:contention_begin { @start_us[tid] = nsecs / 1000; } tracepoint:lock:contention_end { if (args->ret == 0) { @stats[kstack] = hist(nsecs / 1000 - @start_us[tid]); } } END { clear(@start_us); }' And it shows its distribution of slowpath wait time like below. Great. @stats[ __traceiter_contention_end+76 __traceiter_contention_end+76 queued_spin_lock_slowpath+556 _raw_spin_lock+108 rmqueue_bulk+80 rmqueue+1060 get_page_from_freelist+372 __alloc_pages+208 alloc_pages_vma+160 alloc_zeroed_user_highpage_movable+72 do_anonymous_page+92 handle_pte_fault+320 __handle_mm_fault+252 handle_mm_fault+244 do_page_fault+340 do_translation_fault+100 do_mem_abort+76 el0_da+60 el0t_64_sync_handler+232 el0t_64_sync+420 ]: [2, 4) 1 |@ | [4, 8) 30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 25 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 33 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 29 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 13 |@@@@@@@@@@@@@@@@@@@@ | [128, 256) 2 |@@@ | @stats[ __traceiter_contention_end+76 __traceiter_contention_end+76 rwsem_down_write_slowpath+1216 down_write_killable+100 do_mprotect_pkey.constprop.0+176 __arm64_sys_mprotect+40 invoke_syscall+80 el0_svc_common.constprop.0+76 do_el0_svc+52 el0_svc+48 el0t_64_sync_handler+164 el0t_64_sync+420 ]: [1] 13 |@@@@@@@@@@@ | [2, 4) 42 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 5 |@@@@ | [8, 16) 10 |@@@@@@@@ | [16, 32) 60 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 41 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 14 |@@@@@@@@@@@@ | [256, 512) 7 |@@@@@@ | [512, 1K) 6 |@@@@@ | [1K, 2K) 2 |@ | [2K, 4K) 1 | | @stats[ __traceiter_contention_end+76 __traceiter_contention_end+76 queued_spin_lock_slowpath+556 _raw_spin_lock+108 futex_wake+168 do_futex+200 __arm64_sys_futex+112 invoke_syscall+80 el0_svc_common.constprop.0+76 do_el0_svc+52 el0_svc+48 el0t_64_sync_handler+164 el0t_64_sync+420 ]: [0] 3 | | [1] 2515 |@ | [2, 4) 8747 |@@@@@ | [4, 8) 17052 |@@@@@@@@@@ | [8, 16) 46706 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 82105 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 46918 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 99 | | [128, 256) 1 | | [256, 512) 8 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 5 | | [32K, 64K) 12 | | [64K, 128K) 34 | | [128K, 256K) 68 | | [256K, 512K) 7 | | > * Changes in v3 > - move the tracepoints deeper in the slow path > - remove the caller ip > - don't use task state in the flags > > * Changes in v2 > - do not use lockdep infrastructure > - add flags argument to lock:contention_begin tracepoint > > I added a flags argument in the contention_begin to classify locks in > question. It can tell whether it's a spinlock, reader-writer lock or > a mutex. With stacktrace, users can identify which lock is contended. > > The patch 01 added the tracepoints and move the definition to the > mutex.c file so that it can see the tracepoints without lockdep. > > The patch 02 actually installs the tracepoints in the locking code. > To minimize the overhead, they were added in the slow path of the code > separately. As spinlocks are defined in the arch headers, I couldn't > handle them all. I've just added it to generic queued spinlock and > rwlocks only. Each arch can add the tracepoints later. > > This series base on the current tip/locking/core and you get it from > 'locking/tracepoint-v3' branch in my tree at: > > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git > > > Thanks, > Namhyung > > > Namhyung Kim (2): > locking: Add lock contention tracepoints > locking: Apply contention tracepoints in the slow path > > include/trace/events/lock.h | 54 +++++++++++++++++++++++++++++++++-- > kernel/locking/lockdep.c | 1 - > kernel/locking/mutex.c | 6 ++++ > kernel/locking/percpu-rwsem.c | 3 ++ > kernel/locking/qrwlock.c | 9 ++++++ > kernel/locking/qspinlock.c | 5 ++++ > kernel/locking/rtmutex.c | 11 +++++++ > kernel/locking/rwbase_rt.c | 3 ++ > kernel/locking/rwsem.c | 9 ++++++ > kernel/locking/semaphore.c | 14 ++++++++- > 10 files changed, 110 insertions(+), 5 deletions(-) > > > base-commit: cd27ccfc727e99352321c0c75012ab9c5a90321e > -- > 2.35.1.894.gb6a874cedc-goog > -- Thank you, You are awesome! Hyeonggon :-)