The overall goal of this patch is to add tracepoints around mmap_lock acquisition. This will let us collect latency histograms, so we can see how long we block for in the contended case. Our goal is to collect this data across all of production at Google, so low overhead is critical. I'm sending this RFC for feedback on the changes to rwsem.{h,c} and lockdep.h in particular. I'll describe reasoning for the down_write case, for brevity. We want to measure the time lock acquisition takes. Naively, this is: u64 start = sched_clock(); down_write(/* ... */); trace(sched_clock() - start); My measurements show that this adds ~5-6% overhead to building a kernel on a test machine [1]. This level of overhead is unacceptably high. My measurements show that only instrumenting the contended case lowers overhead to < 1%. Naively, we can instrument only the contended case like this: if (!down_write_trylock(/* ... */)) /* Time and call down_write as before. */ However, in the case where `_trylock` succeeds, we have lost the lockdep annotations (e.g. around ordering) `down_write` would normally include. (Granted, we don't run with lockdep in production, but debug builds do.) Assuming we need lower overhead, we aren't okay with losing lock annotations, and we reject various alternatives to this patch: - Making rwsem.c's __down_write and __down_write_trylock public, so mmap_lock.c could construct its own version of LOCK_CONTENDED with tracepoint calls. - Having mmap_lock.c reach into rwsem.c's internals with "extern" forward declarations for these functions (and removing "static inline"). - Somehow adding the instrumentation directly to rwsem.c (either affecting all locks, or polluting it some other way). The remaining alternative, I think, is what this patch proposes: add API surface to rwsem.h which allows callers to provide instrumentation callbacks which are invoked in the contended case. [1]: For measuring the overhead of the instrumentation, I've been timing a defconfig kernel build. The numbers above come from a KVM instance with 4 CPUs + 32G RAM, running 5.8-rc1 with this patch applied and a histogram trigger configured for the acquire_returned tracepoint. My test script is simple: for (( i=0; i<5; ++i)); do make mrproper > /dev/null || exit 1 make defconfig > /dev/null || exit 1 sync || exit 1 echo 3 > /proc/sys/vm/drop_caches || exit 1 /usr/bin/time make -j5 > /dev/null done The numbers I'm giving above are computed as: (avg of 5 runs with this hist trigger enabled) / (avg on 5.8-rc1). Axel Rasmussen (1): mmap_lock: add tracepoints around mmap_lock acquisition include/linux/lockdep.h | 47 ++++++ include/linux/mmap_lock.h | 27 ++- include/linux/rwsem.h | 12 ++ include/trace/events/mmap_lock.h | 76 +++++++++ kernel/locking/rwsem.c | 64 +++++++ mm/Kconfig | 19 +++ mm/Makefile | 1 + mm/mmap_lock.c | 281 +++++++++++++++++++++++++++++++ 8 files changed, 526 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/mmap_lock.h create mode 100644 mm/mmap_lock.c -- 2.27.0.111.gc72c7da667-goog