[RFC PATCH v3 0/1] Add rwsem "contended hook" API and mmap_lock histograms

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Thu, 18 Jun 2020 15:22:24 -0700

The overall goal of this patch is to add tracepoints around mmap_lock
acquisition. This will let us collect latency histograms, so we can see
how long we block for in the contended case. Our goal is to collect this
data across all of production at Google, so low overhead is critical.

I'm sending this RFC for feedback on the changes to rwsem.{h,c} and
lockdep.h in particular. I'll describe reasoning for the down_write case,
for brevity.

We want to measure the time lock acquisition takes. Naively, this is:

u64 start = sched_clock();
down_write(/* ... */);
trace(sched_clock() - start);

My measurements show that this adds ~5-6% overhead to building a kernel on
a test machine [1]. This level of overhead is unacceptably high.

My measurements show that only instrumenting the contended case lowers
overhead to < 1%. Naively, we can instrument only the contended case like
this:

if (!down_write_trylock(/* ... */))
	/* Time and call down_write as before. */

However, in the case where `_trylock` succeeds, we have lost the lockdep
annotations (e.g. around ordering) `down_write` would normally include.
(Granted, we don't run with lockdep in production, but debug builds do.)

Assuming we need lower overhead, we aren't okay with losing lock
annotations, and we reject various alternatives to this patch:

- Making rwsem.c's __down_write and __down_write_trylock public, so
  mmap_lock.c could construct its own version of LOCK_CONTENDED with
  tracepoint calls.
- Having mmap_lock.c reach into rwsem.c's internals with "extern" forward
  declarations for these functions (and removing "static inline").
- Somehow adding the instrumentation directly to rwsem.c (either affecting
  all locks, or polluting it some other way).

The remaining alternative, I think, is what this patch proposes: add API
surface to rwsem.h which allows callers to provide instrumentation
callbacks which are invoked in the contended case.

[1]: For measuring the overhead of the instrumentation, I've been timing a
defconfig kernel build. The numbers above come from a KVM instance with
4 CPUs + 32G RAM, running 5.8-rc1 with this patch applied and a histogram
trigger configured for the acquire_returned tracepoint. My test script is
simple:

for (( i=0; i<5; ++i)); do
        make mrproper > /dev/null || exit 1
        make defconfig > /dev/null || exit 1
        sync || exit 1
        echo 3 > /proc/sys/vm/drop_caches || exit 1
        /usr/bin/time make -j5 > /dev/null
done

The numbers I'm giving above are computed as:
(avg of 5 runs with this hist trigger enabled) / (avg on 5.8-rc1).

Axel Rasmussen (1):
  mmap_lock: add tracepoints around mmap_lock acquisition

 include/linux/lockdep.h          |  47 ++++++
 include/linux/mmap_lock.h        |  27 ++-
 include/linux/rwsem.h            |  12 ++
 include/trace/events/mmap_lock.h |  76 +++++++++
 kernel/locking/rwsem.c           |  64 +++++++
 mm/Kconfig                       |  19 +++
 mm/Makefile                      |   1 +
 mm/mmap_lock.c                   | 281 +++++++++++++++++++++++++++++++
 8 files changed, 526 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 mm/mmap_lock.c

--
2.27.0.111.gc72c7da667-goog