On Fri, May 29, 2020 at 8:04 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > On Fri, 29 May 2020 10:09:57 +0200 > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > On Thu, May 28, 2020 at 06:39:08PM -0700, Axel Rasmussen wrote: > > > > > The use case we have in mind for this is to enable this instrumentation > > > widely in Google's production fleet. Internally, we have a userspace thing > > > which scrapes these metrics and publishes them such that we can look at > > > aggregate metrics across our fleet. The thinking is that mechanisms like > > > lockdep or getting histograms with e.g. BPF attached to the tracepoint > > > introduces too much overhead for this to be viable. (Although, granted, I > > > don't have benchmarks to prove this - if there's skepticism, I can produce > > > such a thing - or prove myself wrong and rethink my approach. :) ) > > > > Whichever way around; I don't believe in special instrumentation like > > this. We'll grow a thousand separate pieces of crap if we go this route. > > > > Next on, someone will come and instrument yet another lock, with yet > > another 1000 lines of gunk. > > > > Why can't you kprobe the mmap_lock things and use ftrace histograms? > > +1. > As far as I can see the series, if you want to make a histogram > of the duration of acquiring locks, you might only need 7/7 (but this > is a minimum subset.) I recommend you to introduce a set of tracepoints > -- start-locking, locked, and released so that we can investigate > which process is waiting for which one. Then you can use either bpf > or ftrace to make a histogram easily. > > Thank you, > > -- > Masami Hiramatsu <mhiramat@xxxxxxxxxx> The reasoning against using BPF or ftrace basically comes down to overhead. My intuition is that BPF/ftrace are great for testing / debugging on a small number of machines, but are less suitable for leaving them enabled in production across many servers. This may not be generally true, but due to how "hot" this lock is, I think this may be sort of a pathological case. Consider maple trees and range locks: if we're running Linux on many servers, with many different workloads, it's useful to see the impact of these changes in production, and in aggregate, over a "long" period of time, instead of just under test conditions on a small number of machines. I'll circle back next week with some benchmarks to confirm/deny my intuition on this. If I can confirm the overhead of BPF / ftrace is low enough, I'll pursue that route instead. The point about special instrumentation is well taken. I completely agree we don't want a file in /proc for each lock in the kernel. :) I think there's some argument to be made that mmap_lock in particular is "special", considering the amount of investment going into optimizing it vs. other locks.