On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@xxxxxx> wrote: > > > > > On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote: > > > > > > > >> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote: > >> > >> > >> > >>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>> > >>> On 9/3/20 1:05 AM, Song Liu wrote: > >>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>>>> > >>>>> Hi Song, > >>>>> > >>>>> Sorry indeed missed it. > >>>>> > >>>>> On 9/2/20 11:33 PM, Song Liu wrote: > >>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote: > >>>>>>> > >>>>>>> We are looking at sharing perf events amount multiple processes via > >>>>>>> pinned perf event array. However, we found this doesn't really work > >>>>>>> as the perf event is removed from the map when the struct file is > >>>>>>> released from user space (in perf_event_fd_array_release). This > >>>>>>> means, the pinned perf event array can be shared among multiple > >>>>>>> process. But each perf event still has single owner. Once the owner > >>>>>>> process closes the fd (or terminates), the perf event is removed > >>>>>>> from the array. I went thought the history of the code and found > >>>>>>> this behavior is actually expected (commit 3b1efb196eee). > >>>>> > >>>>> Right, that auto-cleanup is expected. > >>>>> > >>>>>>> In our use case, however, we want to share the perf event among > >>>>>>> different processes. I think we have a few options to achieve this: > >>>>>>> > >>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN. > >>>>>>> Once this flag is set, we should not remove the fd on struct file > >>>>>>> release. Instead, we remove fd in map_release_uref. > >>>>>>> > >>>>>>> 2. Allow a different process to hold reference to the perf_event > >>>>>>> via pinned perf event array. I guess we can achieve this by > >>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array. > >>>>>>> > >>>>>>> 3. Other ideas? > >>>>>>> > >>>>>>> Could you please let us know your thoughts on this? > >>>>> > >>>>> One option that would work for sure is to transfer the fd to the other > >>>>> processes via fd passing e.g. through pipe or unix domain socket. > >>>> I haven't tried to transfer the fd, but it might be tricky. We need to > >>>> plan for more than 2 processes sharing the events, and these processes > >>>> will start and terminate in any order. > >>>>> I guess my question would be that it would be hard to debug if we keep > >>>>> dangling perf event entries in there yb accident that noone is cleaning > >>>>> up. Some sort of flag is probably okay, but it needs proper introspection > >>>>> facilities from bpftool side so that it could be detected that it's just > >>>>> dangling around waiting for cleanup. > >>>> With my latest design, we don't need to pin the perf_event map (neither > >>>> the prog accessing the map. I guess this can make the clean up problem > >>>> better? So we will add a new flag for map_create. With the flag, we > >>> > >>> I mean pinning the map itself or the prog making use of accessing the map > >>> is not the issue. Afaik, it's more the perf RB that is consuming memory and > >>> can be dangling, so the presence of the /entry/ in the map itself which > >>> would then not be cleaned up by accident, I think this was the motivation > >>> back then at least. Daniel, are you aware of any use cases that do rely on such a behavior of PERV_EVENT_ARRAY? For me this auto-removal of elements on closing *one of a few* PERF_EVENT_ARRAY FDs (original one, but still just one of a few active ones) was extremely surprising. It doesn't follow what we do for any other BPF map, as far as I can tell. E.g., think about BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't auto-remove all the tail call programs, right? There is exactly the same concern with not auto-releasing bpf_progs, just like with perf_event. But it's not accidental, if you are pinning a BPF map, you know what you are doing (at least we have to assume so :). So instead of adding an extra option, shouldn't we just fix this behavior instead and make it the same across all BPF maps that hold kernel resources? > >>> > >>>> will not close the perf_event during process termination, and we block > >>>> pinning for this map, and any program accessing this map. Does this > >>>> sounds like a good plan? > >>> > >>> Could you elaborate why blocking pinning of map/prog is useful in this context? > >> > >> I was thinking, we are more likely to forget cleaning up pinned map. If the > >> map is not pinned, it will be at least cleaned up when all processes accessing > >> it terminate. On the other hand, pinned map would stay until someone removes > >> it from bpffs. So the idea is to avoid the pinning scenario. > >> > >> But I agree this won't solve all the problems. Say process A added a few > >> perf events (A_events) to the map. And process B added some other events > >> (B_events)to the same map. Blocking pinning makes sure we clean up everything > >> when both A and B terminates. But if A terminates soon, while B runs for a > >> long time, A_events will stay open unnecessarily. > >> > >> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map, > >> which returns an fd to the perf event. With this solution, if process A added > >> some events to the map, and process B want to use them after process A > >> terminates. We need to explicitly do the lookup in process B and holds the fd > >> to the perf event. Maybe this is a better solution? > > > > Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd > > on the perf_event, but we still remove the event from the map in > > perf_event_fd_array_release(). Let me see what the best next step... > > CC Andrii and bpf@ thanks, Song! > > Andrii and I had some discussion on this. > > Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached > the diff below. > > On the other hand, we found current behavior of perf_event_array puzzling, > especially pinned perf_event_array (as pinning doesn't really pin the content). > Therefore, we may consider changing the behavior without a flag? > > Thanks, > Song > > > ======================================================================== > [...]