Hi Daniel, > On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote: >>> >>> >>> >>>> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote: >>>> >>>> >>>> >>>>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >>>>> >>>>> On 9/3/20 1:05 AM, Song Liu wrote: >>>>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Hi Song, >>>>>>> >>>>>>> Sorry indeed missed it. >>>>>>> >>>>>>> On 9/2/20 11:33 PM, Song Liu wrote: >>>>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote: >>>>>>>>> >>>>>>>>> We are looking at sharing perf events amount multiple processes via >>>>>>>>> pinned perf event array. However, we found this doesn't really work >>>>>>>>> as the perf event is removed from the map when the struct file is >>>>>>>>> released from user space (in perf_event_fd_array_release). This >>>>>>>>> means, the pinned perf event array can be shared among multiple >>>>>>>>> process. But each perf event still has single owner. Once the owner >>>>>>>>> process closes the fd (or terminates), the perf event is removed >>>>>>>>> from the array. I went thought the history of the code and found >>>>>>>>> this behavior is actually expected (commit 3b1efb196eee). >>>>>>> >>>>>>> Right, that auto-cleanup is expected. >>>>>>> >>>>>>>>> In our use case, however, we want to share the perf event among >>>>>>>>> different processes. I think we have a few options to achieve this: >>>>>>>>> >>>>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN. >>>>>>>>> Once this flag is set, we should not remove the fd on struct file >>>>>>>>> release. Instead, we remove fd in map_release_uref. >>>>>>>>> >>>>>>>>> 2. Allow a different process to hold reference to the perf_event >>>>>>>>> via pinned perf event array. I guess we can achieve this by >>>>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array. >>>>>>>>> >>>>>>>>> 3. Other ideas? >>>>>>>>> >>>>>>>>> Could you please let us know your thoughts on this? >>>>>>> >>>>>>> One option that would work for sure is to transfer the fd to the other >>>>>>> processes via fd passing e.g. through pipe or unix domain socket. >>>>>> I haven't tried to transfer the fd, but it might be tricky. We need to >>>>>> plan for more than 2 processes sharing the events, and these processes >>>>>> will start and terminate in any order. >>>>>>> I guess my question would be that it would be hard to debug if we keep >>>>>>> dangling perf event entries in there yb accident that noone is cleaning >>>>>>> up. Some sort of flag is probably okay, but it needs proper introspection >>>>>>> facilities from bpftool side so that it could be detected that it's just >>>>>>> dangling around waiting for cleanup. >>>>>> With my latest design, we don't need to pin the perf_event map (neither >>>>>> the prog accessing the map. I guess this can make the clean up problem >>>>>> better? So we will add a new flag for map_create. With the flag, we >>>>> >>>>> I mean pinning the map itself or the prog making use of accessing the map >>>>> is not the issue. Afaik, it's more the perf RB that is consuming memory and >>>>> can be dangling, so the presence of the /entry/ in the map itself which >>>>> would then not be cleaned up by accident, I think this was the motivation >>>>> back then at least. > > Daniel, are you aware of any use cases that do rely on such a behavior > of PERV_EVENT_ARRAY? > > For me this auto-removal of elements on closing *one of a few* > PERF_EVENT_ARRAY FDs (original one, but still just one of a few active > ones) was extremely surprising. It doesn't follow what we do for any > other BPF map, as far as I can tell. E.g., think about > BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't > auto-remove all the tail call programs, right? There is exactly the > same concern with not auto-releasing bpf_progs, just like with > perf_event. But it's not accidental, if you are pinning a BPF map, you > know what you are doing (at least we have to assume so :). > > So instead of adding an extra option, shouldn't we just fix this > behavior instead and make it the same across all BPF maps that hold > kernel resources? Could you please share your thoughts on this? I personally don't have strong preference one way (add a flag) or the other (change the default behavior). But I think we need to agree on the direction to go. Thanks, Song [...]