Re: Tracing NVMe Driver with BPF missing events

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 2 Jun 2022 20:01:16 -0700

On Thu, Jun 2, 2022 at 6:52 PM John Mazzie <john.p.mazzie@xxxxxxxxx> wrote:
>
> Thanks for the help. fentry/fexit seem to be working to get every
> event when tracing both events.
>
> I do have one question about how fentry/fexit work in regards to the
> function parameters. fexit can access the function parameters in
> addition to the return value. Let's say the parameters are pointers
> whose value changes between entry and exit on the probed function. Are
> the parameters being accessed on entry or exit in fexit. My assumption
> would be exit, so I could access the modified values. Is that correct?
> The data I'm pulling appears like it might just be the entry
> (non-configured) values.

For fexit programs values of registers corresponding to input
arguments is stored before the function call. If function updates
those register it won't be reflected.

>
> On Tue, May 24, 2022 at 6:40 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Tue, May 24, 2022 at 9:12 AM John Mazzie <john.p.mazzie@xxxxxxxxx> wrote:
> > >
> > > After thinking about this more, maybe it's more to do with the
> > > interrupt handler for nvme_complete_rq.
> > >
> > > In this situation, sometimes the handler for nvme_setup_cmd could be
> > > interrupted to handle nvme_complete_rq and in this situation the
> > > nvme_complete_rq handler wouldn't run. because the nvme_setup_cmd
> > > handler is not complete.
> > >
> > > Is this understanding correct?
> >
> > Yes.
> >
> > > I'm assuming there is no real
> > > workaround for this situation, so we may just need to accept some
> > > missed events.
> >
> > Try using fentry/fexit programs instead. They use different reentrancy
> > protection which is at a per-program level.
> >
> > >
> > > John
> > >
> > > On Sat, May 21, 2022 at 11:52 AM John Mazzie <john.p.mazzie@xxxxxxxxx> wrote:
> > > >
> > > > In this case is a BPF program the individual handler of a tracepoint,
> > > > or in my context, a BPF program my compiled program that traces both
> > > > tracepoints? We aren't running any other BPF tracing during these
> > > > tests besides my program counting these 2 tracepoints.
> > > >
> > > > In my program I have 2 handlers, one for
> > > > tracepoint:nvme:nvme_setup_cmd and another for
> > > > tracepoint:nvme:nvme_complete_rq. I've created a PERCPU_HASH map for
> > > > each handler (unique map for each) to use that keeps track of each
> > > > time the handler is invoked. The only thing that handler is doing in
> > > > each case is incrementing the count value in the map. Though I do
> > > > filter by device on each tracepoint. If I comment out the
> > > > nvme_setup_cmd code the nvme_complete_rq does get the correct count.
> > > >
> > > > The user side of my program just prints the values for each of these
> > > > maps on a 10 second increment.
> > > >
> > > > I can share my code to make this easier, is it preferred that I upload
> > > > my code to github and share the link in this thread?
> > > >
> > > > I agree that your suggestion could be my issue, but I just want to
> > > > make sure we're on the same page since I'm less familiar with the
> > > > internals of BPF.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > > > On Fri, May 20, 2022 at 7:10 PM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, May 18, 2022 at 2:35 PM John Mazzie <john.p.mazzie@xxxxxxxxx> wrote:
> > > > > >
> > > > > > My group at Micron is using BPF and love the tracing capabilities it
> > > > > > provides. We are mainly focused on the storage subsystem and BPF has
> > > > > > been really helpful in understanding how the storage subsystem
> > > > > > interacts with our drives while running applications.
> > > > > >
> > > > > > In the process of developing a tool using BPF to trace the nvme
> > > > > > driver, we ran into an issue with some missing events. I wanted to
> > > > > > check to see if this is possibly a bug/limitation that I'm hitting or
> > > > > > if it's expected behavior with heavy tracing. We are trying to trace 2
> > > > > > trace points (nvme_setup_cmd and nvme_complete_rq) around 1M times a
> > > > > > second.
> > > > > > We noticed if we just trace one of the two, we see all the expected
> > > > > > events, but if we trace both at the same time, the nvme_complete_rq
> > > > >
> > > > > kprobe programs have per-CPU reentrancy protection. That is, if some
> > > > > BPF kprobe/tracepoint program is running and something happens (e.g.,
> > > > > BPF program calls some kernel function that has another BPF program
> > > > > attached to it, or preemption happens and another BPF program is
> > > > > supposed to run) that would trigger another BPF program, then that
> > > > > nested BPF program invocation will be skipped.
> > > > >
> > > > > This might be what happens in your case.
> > > > >
> > > > > > misses events. I am using two different percpu_hash maps to count both
> > > > > > events. One for setup and another for complete. My expectation was
> > > > > > that tracing these events would affect performance, somewhat, but not
> > > > > > miss events. Ultimately the tool would be used to trace nvme latencies
> > > > > > at the driver level by device and process.
> > > > > >
> > > > > > My tool was developed using libbpf v0.7, and I've tested on Rocky
> > > > > > Linux 8.5 (Kernel 4.18.0), Ubuntu 20.04 (Kernel 5.4) and Fedora 36
> > > > > > (Kernel 5.17.6) with the same results.
> > > > > >
> > > > > > Thanks,
> > > > > > John Mazzie
> > > > > > Principal Storage Solutions Engineer
> > > > > > Micron Technology, Inc.