Re: Behavior of pinned perf event array

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 14 Sep 2020 15:59:26 -0700

On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@xxxxxx> wrote:
>
>
>
> > On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote:
> >
> >
> >
> >> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote:
> >>
> >>
> >>
> >>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>
> >>> On 9/3/20 1:05 AM, Song Liu wrote:
> >>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi Song,
> >>>>>
> >>>>> Sorry indeed missed it.
> >>>>>
> >>>>> On 9/2/20 11:33 PM, Song Liu wrote:
> >>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote:
> >>>>>>>
> >>>>>>> We are looking at sharing perf events amount multiple processes via
> >>>>>>> pinned perf event array. However, we found this doesn't really work
> >>>>>>> as the perf event is removed from the map when the struct file is
> >>>>>>> released from user space (in perf_event_fd_array_release). This
> >>>>>>> means, the pinned perf event array can be shared among multiple
> >>>>>>> process. But each perf event still has single owner. Once the owner
> >>>>>>> process closes the fd (or terminates), the perf event is removed
> >>>>>>> from the array. I went thought the history of the code and found
> >>>>>>> this behavior is actually expected (commit 3b1efb196eee).
> >>>>>
> >>>>> Right, that auto-cleanup is expected.
> >>>>>
> >>>>>>> In our use case, however, we want to share the perf event among
> >>>>>>> different processes. I think we have a few options to achieve this:
> >>>>>>>
> >>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
> >>>>>>> Once this flag is set, we should not remove the fd on struct file
> >>>>>>> release. Instead, we remove fd in map_release_uref.
> >>>>>>>
> >>>>>>> 2. Allow a different process to hold reference to the perf_event
> >>>>>>> via pinned perf event array. I guess we can achieve this by
> >>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
> >>>>>>>
> >>>>>>> 3. Other ideas?
> >>>>>>>
> >>>>>>> Could you please let us know your thoughts on this?
> >>>>>
> >>>>> One option that would work for sure is to transfer the fd to the other
> >>>>> processes via fd passing e.g. through pipe or unix domain socket.
> >>>> I haven't tried to transfer the fd, but it might be tricky. We need to
> >>>> plan for more than 2 processes sharing the events, and these processes
> >>>> will start and terminate in any order.
> >>>>> I guess my question would be that it would be hard to debug if we keep
> >>>>> dangling perf event entries in there yb accident that noone is cleaning
> >>>>> up. Some sort of flag is probably okay, but it needs proper introspection
> >>>>> facilities from bpftool side so that it could be detected that it's just
> >>>>> dangling around waiting for cleanup.
> >>>> With my latest design, we don't need to pin the perf_event map (neither
> >>>> the prog accessing the map. I guess this can make the clean up problem
> >>>> better? So we will add a new flag for map_create. With the flag, we
> >>>
> >>> I mean pinning the map itself or the prog making use of accessing the map
> >>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
> >>> can be dangling, so the presence of the /entry/ in the map itself which
> >>> would then not be cleaned up by accident, I think this was the motivation
> >>> back then at least.

Daniel, are you aware of any use cases that do rely on such a behavior
of PERV_EVENT_ARRAY?

For me this auto-removal of elements on closing *one of a few*
PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
ones) was extremely surprising. It doesn't follow what we do for any
other BPF map, as far as I can tell. E.g., think about
BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
auto-remove all the tail call programs, right? There is exactly the
same concern with not auto-releasing bpf_progs, just like with
perf_event. But it's not accidental, if you are pinning a BPF map, you
know what you are doing (at least we have to assume so :).

So instead of adding an extra option, shouldn't we just fix this
behavior instead and make it the same across all BPF maps that hold
kernel resources?

> >>>
> >>>> will not close the perf_event during process termination, and we block
> >>>> pinning for this map, and any program accessing this map. Does this
> >>>> sounds like a good plan?
> >>>
> >>> Could you elaborate why blocking pinning of map/prog is useful in this context?
> >>
> >> I was thinking, we are more likely to forget cleaning up pinned map. If the
> >> map is not pinned, it will be at least cleaned up when all processes accessing
> >> it terminate. On the other hand, pinned map would stay until someone removes
> >> it from bpffs. So the idea is to avoid the pinning scenario.
> >>
> >> But I agree this won't solve all the problems. Say process A added a few
> >> perf events (A_events) to the map. And process B added some other events
> >> (B_events)to the same map. Blocking pinning makes sure we clean up everything
> >> when both A and B terminates. But if A terminates soon, while B runs for a
> >> long time, A_events will stay open unnecessarily.
> >>
> >> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map,
> >> which returns an fd to the perf event. With this solution, if process A added
> >> some events to the map, and process B want to use them after process A
> >> terminates. We need to explicitly do the lookup in process B and holds the fd
> >> to the perf event. Maybe this is a better solution?
> >
> > Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd
> > on the perf_event, but we still remove the event from the map in
> > perf_event_fd_array_release(). Let me see what the best next step...
>
> CC Andrii and bpf@

thanks, Song!

>
> Andrii and I had some discussion on this.
>
> Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached
> the diff below.
>
> On the other hand, we found current behavior of perf_event_array puzzling,
> especially pinned perf_event_array (as pinning doesn't really pin the content).
> Therefore, we may consider changing the behavior without a flag?
>
> Thanks,
> Song
>
>
> ========================================================================
>

[...]