> On Sep 8, 2020, at 5:32 PM, Song Liu <songliubraving@xxxxxx> wrote: > > > >> On Sep 8, 2020, at 10:22 AM, Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Sep 3, 2020, at 2:22 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >>> >>> On 9/3/20 1:05 AM, Song Liu wrote: >>>>> On Sep 2, 2020, at 3:28 PM, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >>>>> >>>>> Hi Song, >>>>> >>>>> Sorry indeed missed it. >>>>> >>>>> On 9/2/20 11:33 PM, Song Liu wrote: >>>>>>> On Aug 24, 2020, at 4:57 PM, Song Liu <songliubraving@xxxxxx> wrote: >>>>>>> >>>>>>> We are looking at sharing perf events amount multiple processes via >>>>>>> pinned perf event array. However, we found this doesn't really work >>>>>>> as the perf event is removed from the map when the struct file is >>>>>>> released from user space (in perf_event_fd_array_release). This >>>>>>> means, the pinned perf event array can be shared among multiple >>>>>>> process. But each perf event still has single owner. Once the owner >>>>>>> process closes the fd (or terminates), the perf event is removed >>>>>>> from the array. I went thought the history of the code and found >>>>>>> this behavior is actually expected (commit 3b1efb196eee). >>>>> >>>>> Right, that auto-cleanup is expected. >>>>> >>>>>>> In our use case, however, we want to share the perf event among >>>>>>> different processes. I think we have a few options to achieve this: >>>>>>> >>>>>>> 1. Introduce a new flag for the perf event map, like BPF_F_KEEP_PE_OPEN. >>>>>>> Once this flag is set, we should not remove the fd on struct file >>>>>>> release. Instead, we remove fd in map_release_uref. >>>>>>> >>>>>>> 2. Allow a different process to hold reference to the perf_event >>>>>>> via pinned perf event array. I guess we can achieve this by >>>>>>> enabling for BPF_MAP_UPDATE_ELEM perf event array. >>>>>>> >>>>>>> 3. Other ideas? >>>>>>> >>>>>>> Could you please let us know your thoughts on this? >>>>> >>>>> One option that would work for sure is to transfer the fd to the other >>>>> processes via fd passing e.g. through pipe or unix domain socket. >>>> I haven't tried to transfer the fd, but it might be tricky. We need to >>>> plan for more than 2 processes sharing the events, and these processes >>>> will start and terminate in any order. >>>>> I guess my question would be that it would be hard to debug if we keep >>>>> dangling perf event entries in there yb accident that noone is cleaning >>>>> up. Some sort of flag is probably okay, but it needs proper introspection >>>>> facilities from bpftool side so that it could be detected that it's just >>>>> dangling around waiting for cleanup. >>>> With my latest design, we don't need to pin the perf_event map (neither >>>> the prog accessing the map. I guess this can make the clean up problem >>>> better? So we will add a new flag for map_create. With the flag, we >>> >>> I mean pinning the map itself or the prog making use of accessing the map >>> is not the issue. Afaik, it's more the perf RB that is consuming memory and >>> can be dangling, so the presence of the /entry/ in the map itself which >>> would then not be cleaned up by accident, I think this was the motivation >>> back then at least. >>> >>>> will not close the perf_event during process termination, and we block >>>> pinning for this map, and any program accessing this map. Does this >>>> sounds like a good plan? >>> >>> Could you elaborate why blocking pinning of map/prog is useful in this context? >> >> I was thinking, we are more likely to forget cleaning up pinned map. If the >> map is not pinned, it will be at least cleaned up when all processes accessing >> it terminate. On the other hand, pinned map would stay until someone removes >> it from bpffs. So the idea is to avoid the pinning scenario. >> >> But I agree this won't solve all the problems. Say process A added a few >> perf events (A_events) to the map. And process B added some other events >> (B_events)to the same map. Blocking pinning makes sure we clean up everything >> when both A and B terminates. But if A terminates soon, while B runs for a >> long time, A_events will stay open unnecessarily. >> >> Alternatively, we can implement map_fd_sys_lookup_elem for perf event map, >> which returns an fd to the perf event. With this solution, if process A added >> some events to the map, and process B want to use them after process A >> terminates. We need to explicitly do the lookup in process B and holds the fd >> to the perf event. Maybe this is a better solution? > > Actually, this doesn't work. :( With map_fd_sys_lookup_elem(), we can get a fd > on the perf_event, but we still remove the event from the map in > perf_event_fd_array_release(). Let me see what the best next step... CC Andrii and bpf@ Andrii and I had some discussion on this. Currently, I am working on something with a new flag BPF_F_SHARE_PE. I attached the diff below. On the other hand, we found current behavior of perf_event_array puzzling, especially pinned perf_event_array (as pinning doesn't really pin the content). Therefore, we may consider changing the behavior without a flag? Thanks, Song ======================================================================== >From f78720bbaec9aae3a44bd21e0902d5a215dfe0b0 Mon Sep 17 00:00:00 2001 From: Song Liu <songliubraving@xxxxxx> Date: Tue, 11 Aug 2020 13:38:55 -0700 Subject: [PATCH 1/3] bpf: preserve fd in pinned perf_event_map --- include/uapi/linux/bpf.h | 3 +++ kernel/bpf/arraymap.c | 31 +++++++++++++++++++++++++++++-- tools/include/uapi/linux/bpf.h | 3 +++ 3 files changed, 35 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 90359cab501d1..b527a28a4ba5c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -413,6 +413,9 @@ enum { /* Enable memory-mapping BPF map */ BPF_F_MMAPABLE = (1U << 10), + +/* Share perf_event among processes */ + BPF_F_SHARE_PE = (1U << 11), }; /* Flags for BPF_PROG_QUERY. */ diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e046fb7d17cd0..3bb58fe3d444c 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -15,7 +15,7 @@ #include "map_in_map.h" #define ARRAY_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK) + (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | BPF_F_SHARE_PE) static void bpf_array_free_percpu(struct bpf_array *array) { @@ -64,6 +64,10 @@ int array_map_alloc_check(union bpf_attr *attr) attr->map_flags & BPF_F_MMAPABLE) return -EINVAL; + if (attr->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY && + attr->map_flags & BPF_F_SHARE_PE) + return -EINVAL; + if (attr->value_size > KMALLOC_MAX_SIZE) /* if value_size is bigger, the user space won't be able to * access the elements. @@ -778,6 +782,26 @@ static int fd_array_map_delete_elem(struct bpf_map *map, void *key) } } +static void perf_event_fd_array_map_free(struct bpf_map *map) +{ + struct bpf_array *array = container_of(map, struct bpf_array, map); + struct bpf_event_entry *ee; + int i; + + if (map->map_flags & BPF_F_SHARE_PE) { + for (i = 0; i < array->map.max_entries; i++) { + ee = READ_ONCE(array->ptrs[i]); + if (ee) + fd_array_map_delete_elem(map, &i); + } + } else { + for (i = 0; i < array->map.max_entries; i++) + BUG_ON(array->ptrs[i] != NULL); + } + + bpf_map_area_free(array); +} + static void *prog_fd_array_get_ptr(struct bpf_map *map, struct file *map_file, int fd) { @@ -1105,6 +1129,9 @@ static void perf_event_fd_array_release(struct bpf_map *map, struct bpf_event_entry *ee; int i; + if (map->map_flags & BPF_F_SHARE_PE) + return; + rcu_read_lock(); for (i = 0; i < array->map.max_entries; i++) { ee = READ_ONCE(array->ptrs[i]); @@ -1119,7 +1146,7 @@ const struct bpf_map_ops perf_event_array_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_map_alloc, - .map_free = fd_array_map_free, + .map_free = perf_event_fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, .map_delete_elem = fd_array_map_delete_elem, diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 90359cab501d1..b527a28a4ba5c 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -413,6 +413,9 @@ enum { /* Enable memory-mapping BPF map */ BPF_F_MMAPABLE = (1U << 10), + +/* Share perf_event among processes */ + BPF_F_SHARE_PE = (1U << 11), }; /* Flags for BPF_PROG_QUERY. */ -- 2.24.1