Re: Behavior of pinned perf event array

Song Liu <songliubraving@xxxxxx> · Fri, 11 Sep 2020 20:36:41 +0000

> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote:
> 
> 
> 
>> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote:
>> 
>> 
>> 
>>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>> 
>>> On 9/3/20 1:05 AM, Song Liu wrote:
>>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> Hi Song,
>>>>> 
>>>>> Sorry indeed missed it.
>>>>> 
>>>>> On 9/2/20 11:33 PM, Song Liu wrote:
>>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote:
>>>>>>> 
>>>>>>> We are looking at sharing perf events amount multiple processes via
>>>>>>> pinned perf event array. However, we found this doesn't really work
>>>>>>> as the perf event is removed from the map when the struct file is
>>>>>>> released from user space (in perf_event_fd_array_release). This
>>>>>>> means, the pinned perf event array can be shared among multiple
>>>>>>> process. But each perf event still has single owner. Once the owner
>>>>>>> process closes the fd (or terminates), the perf event is removed
>>>>>>> from the array. I went thought the history of the code and found
>>>>>>> this behavior is actually expected (commit 3b1efb196eee).
>>>>> 
>>>>> Right, that auto-cleanup is expected.
>>>>> 
>>>>>>> In our use case, however, we want to share the perf event among
>>>>>>> different processes. I think we have a few options to achieve this:
>>>>>>> 
>>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN.
>>>>>>> Once this flag is set, we should not remove the fd on struct file
>>>>>>> release. Instead, we remove fd in map_release_uref.
>>>>>>> 
>>>>>>> 2. Allow a different process to hold reference to the perf_event
>>>>>>> via pinned perf event array. I guess we can achieve this by
>>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array.
>>>>>>> 
>>>>>>> 3. Other ideas?
>>>>>>> 
>>>>>>> Could you please let us know your thoughts on this?
>>>>> 
>>>>> One option that would work for sure is to transfer the fd to the other
>>>>> processes via fd passing e.g. through pipe or unix domain socket.
>>>> I haven't tried to transfer the fd, but it might be tricky. We need to
>>>> plan for more than 2 processes sharing the events, and these processes
>>>> will start and terminate in any order.
>>>>> I guess my question would be that it would be hard to debug if we keep
>>>>> dangling perf event entries in there yb accident that noone is cleaning
>>>>> up. Some sort of flag is probably okay, but it needs proper introspection
>>>>> facilities from bpftool side so that it could be detected that it's just
>>>>> dangling around waiting for cleanup.
>>>> With my latest design, we don't need to pin the perf_event map (neither
>>>> the prog accessing the map. I guess this can make the clean up problem
>>>> better? So we will add a new flag for map_create. With the flag, we
>>> 
>>> I mean pinning the map itself or the prog making use of accessing the map
>>> is not the issue. Afaik, it's more the perf RB that is consuming memory and
>>> can be dangling, so the presence of the /entry/ in the map itself which
>>> would then not be cleaned up by accident, I think this was the motivation
>>> back then at least.
>>> 
>>>> will not close the perf_event during process termination, and we block
>>>> pinning for this map, and any program accessing this map. Does this
>>>> sounds like a good plan?
>>> 
>>> Could you elaborate why blocking pinning of map/prog is useful in this context?
>> 
>> I was thinking, we are more likely to forget cleaning up pinned map. If the
>> map is not pinned, it will be at least cleaned up when all processes accessing 
>> it terminate. On the other hand, pinned map would stay until someone removes
>> it from bpffs. So the idea is to avoid the pinning scenario. 
>> 
>> But I agree this won't solve all the problems. Say process A added a few 
>> perf events (A_events) to the map. And process B added some other events 
>> (B_events)to the same map. Blocking pinning makes sure we clean up everything
>> when both A and B terminates. But if A terminates soon, while B runs for a 
>> long time, A_events will stay open unnecessarily. 
>> 
>> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map, 
>> which returns an fd to the perf event. With this solution, if process A added 
>> some events to the map, and process B want to use them after process A 
>> terminates. We need to explicitly do the lookup in process B and holds the fd 
>> to the perf event. Maybe this is a better solution?
> 
> Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd
> on the perf_event, but we still remove the event from the map in 
> perf_event_fd_array_release(). Let me see what the best next step...

CC Andrii and bpf@

Andrii and I had some discussion on this. 

Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached 
the diff below. 

On the other hand, we found current behavior of perf_event_array puzzling, 
especially pinned perf_event_array (as pinning doesn't really pin the content). 
Therefore, we may consider changing the behavior without a flag?

Thanks,
Song


========================================================================

>From f78720bbaec9aae3a44bd21e0902d5a215dfe0b0 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@xxxxxx>
Date: Tue, 11 Aug 2020 13:38:55 -0700
Subject: [PATCH 1/3] bpf: preserve fd in pinned perf_event_map

---
 include/uapi/linux/bpf.h       |  3 +++
 kernel/bpf/arraymap.c          | 31 +++++++++++++++++++++++++++++--
 tools/include/uapi/linux/bpf.h |  3 +++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 90359cab501d1..b527a28a4ba5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -413,6 +413,9 @@ enum {

 /* Enable memory-mapping BPF map */
        BPF_F_MMAPABLE          = (1U << 10),
+
+/* Share perf_event among processes */
+       BPF_F_SHARE_PE          = (1U << 11),
 };

 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e046fb7d17cd0..3bb58fe3d444c 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -15,7 +15,7 @@
 #include "map_in_map.h"

 #define ARRAY_CREATE_FLAG_MASK \
-       (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK)
+       (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | BPF_F_SHARE_PE)

 static void bpf_array_free_percpu(struct bpf_array *array)
 {
@@ -64,6 +64,10 @@ int array_map_alloc_check(union bpf_attr *attr)
            attr->map_flags & BPF_F_MMAPABLE)
                return -EINVAL;

+       if (attr->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
+           attr->map_flags & BPF_F_SHARE_PE)
+               return -EINVAL;
+
        if (attr->value_size > KMALLOC_MAX_SIZE)
                /* if value_size is bigger, the user space won't be able to
                 * access the elements.
@@ -778,6 +782,26 @@ static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
        }
 }

+static void perf_event_fd_array_map_free(struct bpf_map *map)
+{
+       struct bpf_array *array = container_of(map, struct bpf_array, map);
+       struct bpf_event_entry *ee;
+       int i;
+
+       if (map->map_flags & BPF_F_SHARE_PE) {
+               for (i = 0; i < array->map.max_entries; i++) {
+                       ee = READ_ONCE(array->ptrs[i]);
+                       if (ee)
+                               fd_array_map_delete_elem(map, &i);
+               }
+       } else {
+               for (i = 0; i < array->map.max_entries; i++)
+                       BUG_ON(array->ptrs[i] != NULL);
+       }
+
+       bpf_map_area_free(array);
+}
+
 static void *prog_fd_array_get_ptr(struct bpf_map *map,
                                   struct file *map_file, int fd)
 {
@@ -1105,6 +1129,9 @@ static void perf_event_fd_array_release(struct bpf_map *map,
        struct bpf_event_entry *ee;
        int i;

+       if (map->map_flags & BPF_F_SHARE_PE)
+               return;
+
        rcu_read_lock();
        for (i = 0; i < array->map.max_entries; i++) {
                ee = READ_ONCE(array->ptrs[i]);
@@ -1119,7 +1146,7 @@ const struct bpf_map_ops perf_event_array_map_ops = {
        .map_meta_equal = bpf_map_meta_equal,
        .map_alloc_check = fd_array_map_alloc_check,
        .map_alloc = array_map_alloc,
-       .map_free = fd_array_map_free,
+       .map_free = perf_event_fd_array_map_free,
        .map_get_next_key = array_map_get_next_key,
        .map_lookup_elem = fd_array_map_lookup_elem,
        .map_delete_elem = fd_array_map_delete_elem,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 90359cab501d1..b527a28a4ba5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -413,6 +413,9 @@ enum {

 /* Enable memory-mapping BPF map */
        BPF_F_MMAPABLE          = (1U << 10),
+
+/* Share perf_event among processes */
+       BPF_F_SHARE_PE          = (1U << 11),
 };

 /* Flags for BPF_PROG_QUERY. */
--
2.24.1