Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 23 May 2019 19:08:51 -0700

On Thu, May 23, 2019 at 09:57:37PM -0400, Steven Rostedt wrote:
> On Thu, 23 May 2019 17:31:50 -0700
> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> 
> 
> > > Now from what I'm reading, it seams that the Dtrace layer may be
> > > abstracting out fields from the kernel. This is actually something I
> > > have been thinking about to solve the "tracepoint abi" issue. There's
> > > usually basic ideas that happen. An interrupt goes off, there's a
> > > handler, etc. We could abstract that out that we trace when an
> > > interrupt goes off and the handler happens, and record the vector
> > > number, and/or what device it was for. We have tracepoints in the
> > > kernel that do this, but they do depend a bit on the implementation.
> > > Now, if we could get a layer that abstracts this information away from
> > > the implementation, then I think that's a *good* thing.  
> > 
> > I don't like this deferred irq idea at all.
> 
> What do you mean deferred?

that's how I interpreted your proposal: 
"interrupt goes off and the handler happens, and record the vector number"
It's not a good thing to tell about irq later.
Just like saying lets record perf counter event and report it later.

> > Abstracting details from the users is _never_ a good idea.
> 
> Really? Most everything we do is to abstract details from the user. The
> key is to make the abstraction more meaningful than the raw data.
> 
> > A ton of people use bcc scripts and bpftrace because they want those details.
> > They need to know what kernel is doing to make better decisions.
> > Delaying irq record is the opposite.
> 
> I never said anything about delaying the record. Just getting the
> information that is needed.
> 
> > > 
> > > I wish that was totally true, but tracepoints *can* be an abi. I had
> > > code reverted because powertop required one to be a specific
> > > format. To this day, the wakeup event has a "success" field that
> > > writes in a hardcoded "1", because there's tools that depend on it,
> > > and they only work if there's a success field and the value is 1.  
> > 
> > I really think that you should put powertop nightmares to rest.
> > That was long ago. The kernel is different now.
> 
> Is it?
> 
> > Linus made it clear several times that it is ok to change _all_
> > tracepoints. Period. Some maintainers somehow still don't believe
> > that they can do it.
> 
> From what I remember him saying several times, is that you can change
> all tracepoints, but if it breaks a tool that is useful, then that
> change will get reverted. He will allow you to go and fix that tool and
> bring back the change (which was the solution to powertop).

my interpretation is different.
We changed tracepoints. It broke scripts. People changed scripts.

> 
> > 
> > Some tracepoints are used more than others and more people will
> > complain: "ohh I need to change my script" when that tracepoint
> > changes. But the kernel development is not going to be hampered by a
> > tracepoint. No matter how widespread its usage in scripts.
> 
> That's because we'll treat bpf (and Dtrace) scripts like modules (no
> abi), at least we better. But if there's a tool that doesn't use the
> script and reads the tracepoint directly via perf, then that's a
> different story.

absolutely not.
tracepoint is a tracepoint. It can change regardless of what
and how is using it.