Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Tue, 21 Nov 2023 11:27:28 -0800

On 11/20/23 10:11 PM, David Marchevsky wrote:

On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
On 11/20/23 9:59 AM, Dave Marchevsky wrote:

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 173ec7f43ed1..114973f925ea 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -69,7 +69,17 @@ struct bpf_local_storage_data {
        * the number of cachelines accessed during the cache hit case.
        */
       struct bpf_local_storage_map __rcu *smap;
-    u8 data[] __aligned(8);
+    /* Need to duplicate smap's map_flags as smap may be gone when
+     * it's time to free bpf_local_storage_data
+     */
+    u64 smap_map_flags;
+    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
+     * Otherwise the actual mapval data lives here
+     */
+    union {
+        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
+        void *actual_data __aligned(8);
The pages (that can be mmap'ed later) feel like a specific kind of kptr.

Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.

struct normal_and_mmap_value {
     int some_int;
     int __percpu_kptr *some_cnts;

     struct bpf_mmap_page __kptr *some_stats;
};

struct mmap_only_value {
     struct bpf_mmap_page __kptr *some_stats;
};

[ ... ]

This is an intriguing idea. For conciseness I'll call this specific
kind of kptr 'mmapable kptrs' for the rest of this message. Below is
more of a brainstorming dump than a cohesive response, separate trains
of thought are separated by two newlines.
Thanks for bearing with me while some ideas could be crazy. I am trying to see 
how this would look like for other local storage, sk and inode. Allocating a 
page for each sk will not be nice for server with half a million sk(s). e.g. 
half a million sk(s) sharing a few bandwidth policies or a few tuning 
parameters. Creating something mmap'able to the user space and also sharable 
among many sk(s) will be useful.

My initial thought upon seeing struct normal_and_mmap_value was to note
that we currently don't support mmaping for map_value types with _any_
special fields ('special' as determined by btf_parse_fields). But IIUC
you're actually talking about exposing the some_stats pointee memory via
mmap, not the containing struct with kptr fields. That is, for maps that
support these kptrs, mmap()ing a map with value type struct
normal_and_mmap_value would return the some_stats pointer value, and
likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
logic in this patch. We'd only be able to support one such mmapable kptr
field per mapval type, but that isn't a dealbreaker.

Some maps, like task_storage, would only support mmap() on a map_value
with mmapable kptr field, as mmap()ing the mapval itself doesn't make
sense or is unsafe. Seems like arraymap would do the opposite, only
Changing direction a bit since arraymap is brought up. :)

arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an 
arraymap as kptr, the bpf prog should be able to access it as a map. More like 
the current map-in-map setup. The arraymap can be used as regular map in the 
user space also (like pinning). It may need some btf plumbing to tell the value 
type of the arrayamp to the verifier.
The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags) 
can be used where the value->array_mmap initialized as an arraymap_fd. This will 
limit the arraymap kptr update only from the syscall side which seems to be your 
usecase also? Allocating the arraymap from the bpf prog side needs some thoughts 
and need a whitelist.
The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd, 
&task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we 
can create a libbpf helper to free the fd(s) resources held in the looked-up 
value by using the value's btf.
The bpf_local_storage_map side probably does not need to support mmap() then.


supporting mmap()ing the mapval itself. I'm curious if any map could
feasibly support both, and if so, might have to do logic like:

   if (map_val has mmapable kptr)
      mmap the pointee of mmapable kptr
   else
      mmap the map_val itself

Which is maybe too confusing of a detail to expose to BPF program
writers. Maybe a little too presumptuous and brainstorm-ey given the
limited number of mmap()able maps currently, but making this a kptr type
means maps should either ignore/fail if they don't support it, or have
consistent semantics amongst maps that do support it.


Instead of  struct bpf_mmap_page __kptr *some_stats;  I'd prefer
something like

struct my_type { long count; long another_count; };

struct mmap_only_value {
   struct my_type __mmapable_kptr *some_stats;
};

This way the type of mmap()able field is known to BPF programs that
interact with it. This is all assuming that struct bpf_mmap_page is an
opaque page-sized blob of bytes.


We could then support structs like

struct mmap_value_and_lock {
   struct bpf_spin_lock l;
   int some_int;
   struct my_type __mmapable_kptr *some_stats;
};

and have bpf_map_update_elem handler use the spin_lock instead of
map-internal lock where appropriate. But no way to ensure userspace task
using the mmap()ed region uses the spin_lock.

diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 146824cc9689..9b3becbcc1a3 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -15,7 +15,8 @@
   #include <linux/rcupdate_trace.h>
   #include <linux/rcupdate_wait.h>
   -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
+#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
+    (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
     static struct bpf_local_storage_map_bucket *
   select_bucket(struct bpf_local_storage_map *smap,
@@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
       return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
   }
   +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
+
+void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
+{
+    struct mem_cgroup *memcg, *old_memcg;
+    void *ptr;
+
+    memcg = bpf_map_get_memcg(&smap->map);
+    old_memcg = set_active_memcg(memcg);
+    ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
+                      NUMA_NO_NODE);
+    set_active_memcg(old_memcg);
+    mem_cgroup_put(memcg);
+
+    return ptr;
+}
[ ... ]

@@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
           void *value, bool charge_mem, gfp_t gfp_flags)
   {
       struct bpf_local_storage_elem *selem;
+    void *mmapable_value = NULL;
+    u32 selem_mem;
   -    if (charge_mem && mem_charge(smap, owner, smap->elem_size))
+    selem_mem = selem_bytes_used(smap);
+    if (charge_mem && mem_charge(smap, owner, selem_mem))
           return NULL;
   +    if (smap->map.map_flags & BPF_F_MMAPABLE) {
+        mmapable_value = alloc_mmapable_selem_value(smap);
This probably is not always safe for bpf prog to do. Leaving the gfp_flags was not used aside, the bpf local storage is moving to the bpf's memalloc because of https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@xxxxxxxxxx/

Minor point: alloc_mmapable_selem_value's bpf_map_area_mmapable_alloc
call will always call vmalloc under the hood. vmalloc has locks as well,
so your point stands.

I think I see how this ties into your 'specific kptr type' proposal
above. Let me know if this sounds right: if there was a bpf_mem_alloc
type focused on providing vmalloc'd mmap()able memory, we could use it
here instead of raw vmalloc and avoid the lock recursion problem linked
above. Such an allocator could be used in something like bpf_obj_new to
create the __mmapable_kptr - either from BPF prog or mmap path before
remap_vmalloc_range.

re: gfp_flags, looks like verifier is setting this param to either
GFP_ATOMIC or GFP_KERNEL. Looks like we should not allow GFP_KERNEL
allocs here?
Going back to this patch, not sure what does it take to make bpf_mem_alloc() 
mmap()able. May be we can limit the blast radius for now, like limit this alloc 
to the user space mmap() call for now. Does it fit your use case? A whitelist 
for bpf prog could also be created later if needed.
+        if (!mmapable_value)
+            goto err_out;
+    }
+