Re: [PATCH v2 0/7] Add histogram measuring mmap_lock contention latency

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Fri, 29 May 2020 15:38:05 -0700

On Fri, May 29, 2020 at 8:04 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Fri, 29 May 2020 10:09:57 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > On Thu, May 28, 2020 at 06:39:08PM -0700, Axel Rasmussen wrote:
> >
> > > The use case we have in mind for this is to enable this instrumentation
> > > widely in Google's production fleet. Internally, we have a userspace thing
> > > which scrapes these metrics and publishes them such that we can look at
> > > aggregate metrics across our fleet. The thinking is that mechanisms like
> > > lockdep or getting histograms with e.g. BPF attached to the tracepoint
> > > introduces too much overhead for this to be viable. (Although, granted, I
> > > don't have benchmarks to prove this - if there's skepticism, I can produce
> > > such a thing - or prove myself wrong and rethink my approach. :) )
> >
> > Whichever way around; I don't believe in special instrumentation like
> > this. We'll grow a thousand separate pieces of crap if we go this route.
> >
> > Next on, someone will come and instrument yet another lock, with yet
> > another 1000 lines of gunk.
> >
> > Why can't you kprobe the mmap_lock things and use ftrace histograms?
>
> +1.
> As far as I can see the series, if you want to make a histogram
> of the duration of acquiring locks, you might only need 7/7 (but this
> is a minimum subset.) I recommend you to introduce a set of tracepoints
>  -- start-locking, locked, and released so that we can investigate
> which process is waiting for which one. Then you can use either bpf
> or ftrace to make a histogram easily.
>
> Thank you,
>
> --
> Masami Hiramatsu <mhiramat@xxxxxxxxxx>

The reasoning against using BPF or ftrace basically comes down to
overhead. My intuition is that BPF/ftrace are great for testing /
debugging on a small number of machines, but are less suitable for
leaving them enabled in production across many servers. This may not
be generally true, but due to how "hot" this lock is, I think this may
be sort of a pathological case.

Consider maple trees and range locks: if we're running Linux on many
servers, with many different workloads, it's useful to see the impact
of these changes in production, and in aggregate, over a "long" period
of time, instead of just under test conditions on a small number of
machines.

I'll circle back next week with some benchmarks to confirm/deny my
intuition on this. If I can confirm the overhead of BPF / ftrace is
low enough, I'll pursue that route instead.

The point about special instrumentation is well taken. I completely
agree we don't want a file in /proc for each lock in the kernel. :) I
think there's some argument to be made that mmap_lock in particular is
"special", considering the amount of investment going into optimizing
it vs. other locks.