Re: Behavior of pinned perf event array

Song Liu <songliubraving@xxxxxx> · Wed, 23 Sep 2020 16:21:50 +0000

Hi Daniel,

> On Sep 14, 2020, at 3:59 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> 
> On Fri, Sep 11, 2020 at 1:36 PM Song Liu <songliubraving@xxxxxx> wrote:
>> 
>> 
>> 
>>> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On 9/3/20 1:05 AM, Song Liu wrote:
>>>>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Hi Song,
>>>>>>> 
>>>>>>> Sorry indeed missed it.
>>>>>>> 
>>>>>>> On 9/2/20 11:33 PM, Song Liu wrote:
>>>>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> We are looking at sharing perf events amount multiple processes via
>>>>>>>>> pinned perf event array. However, we found this doesn't really work
>>>>>>>>> as the perf event is removed from the map when the struct file is
>>>>>>>>> released from user space (in perf_event_fd_array_release). This
>>>>>>>>> means, the pinned perf event array can be shared among multiple
>>>>>>>>> process. But each perf event still has single owner. Once the owner
>>>>>>>>> process closes the fd (or terminates), the perf event is removed
>>>>>>>>> from the array. I went thought the history of the code and found
>>>>>>>>> this behavior is actually expected (commit 3b1efb196eee).
>>>>>>> 
>>>>>>> Right, that auto-cleanup is expected.
>>>>>>> 
>>>>>>>>> In our use case, however, we want to share the perf event among
>>>>>>>>> different processes. I think we have a few options to achieve this:
>>>>>>>>> 
>>>>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
>>>>>>>>> Once this flag is set, we should not remove the fd on struct file
>>>>>>>>> release. Instead, we remove fd in map_release_uref.
>>>>>>>>> 
>>>>>>>>> 2. Allow a different process to hold reference to the perf_event
>>>>>>>>> via pinned perf event array. I guess we can achieve this by
>>>>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
>>>>>>>>> 
>>>>>>>>> 3. Other ideas?
>>>>>>>>> 
>>>>>>>>> Could you please let us know your thoughts on this?
>>>>>>> 
>>>>>>> One option that would work for sure is to transfer the fd to the other
>>>>>>> processes via fd passing e.g. through pipe or unix domain socket.
>>>>>> I haven't tried to transfer the fd, but it might be tricky. We need to
>>>>>> plan for more than 2 processes sharing the events, and these processes
>>>>>> will start and terminate in any order.
>>>>>>> I guess my question would be that it would be hard to debug if we keep
>>>>>>> dangling perf event entries in there yb accident that noone is cleaning
>>>>>>> up. Some sort of flag is probably okay, but it needs proper introspection
>>>>>>> facilities from bpftool side so that it could be detected that it's just
>>>>>>> dangling around waiting for cleanup.
>>>>>> With my latest design, we don't need to pin the perf_event map (neither
>>>>>> the prog accessing the map. I guess this can make the clean up problem
>>>>>> better? So we will add a new flag for map_create. With the flag, we
>>>>> 
>>>>> I mean pinning the map itself or the prog making use of accessing the map
>>>>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
>>>>> can be dangling, so the presence of the /entry/ in the map itself which
>>>>> would then not be cleaned up by accident, I think this was the motivation
>>>>> back then at least.
> 
> Daniel, are you aware of any use cases that do rely on such a behavior
> of PERV_EVENT_ARRAY?
> 
> For me this auto-removal of elements on closing *one of a few*
> PERF_EVENT_ARRAY FDs (original one, but still just one of a few active
> ones) was extremely surprising. It doesn't follow what we do for any
> other BPF map, as far as I can tell. E.g., think about
> BPF_MAP_TYPE_PROG_ARRAY. If we pin it in BPF FS and close FD, it won't
> auto-remove all the tail call programs, right? There is exactly the
> same concern with not auto-releasing bpf_progs, just like with
> perf_event. But it's not accidental, if you are pinning a BPF map, you
> know what you are doing (at least we have to assume so :).
> 
> So instead of adding an extra option, shouldn't we just fix this
> behavior instead and make it the same across all BPF maps that hold
> kernel resources?

Could you please share your thoughts on this? I personally don't have 
strong preference one way (add a flag) or the other (change the default
behavior). But I think we need to agree on the direction to go. 

Thanks,
Song

[...]