Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

David Marchevsky <david.marchevsky@xxxxxxxxx> · Tue, 21 Nov 2023 01:11:58 -0500

On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>> index 173ec7f43ed1..114973f925ea 100644
>> --- a/include/linux/bpf_local_storage.h
>> +++ b/include/linux/bpf_local_storage.h
>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>        * the number of cachelines accessed during the cache hit case.
>>        */
>>       struct bpf_local_storage_map __rcu *smap;
>> -    u8 data[] __aligned(8);
>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>> +     * it's time to free bpf_local_storage_data
>> +     */
>> +    u64 smap_map_flags;
>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>> +     * Otherwise the actual mapval data lives here
>> +     */
>> +    union {
>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>> +        void *actual_data __aligned(8);
> 
> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
> 
> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
> 
> struct normal_and_mmap_value {
>     int some_int;
>     int __percpu_kptr *some_cnts;
> 
>     struct bpf_mmap_page __kptr *some_stats;
> };
> 
> struct mmap_only_value {
>     struct bpf_mmap_page __kptr *some_stats;
> };
> 
> [ ... ]
> 

This is an intriguing idea. For conciseness I'll call this specific
kind of kptr 'mmapable kptrs' for the rest of this message. Below is
more of a brainstorming dump than a cohesive response, separate trains
of thought are separated by two newlines.

My initial thought upon seeing struct normal_and_mmap_value was to note
that we currently don't support mmaping for map_value types with _any_
special fields ('special' as determined by btf_parse_fields). But IIUC
you're actually talking about exposing the some_stats pointee memory via
mmap, not the containing struct with kptr fields. That is, for maps that
support these kptrs, mmap()ing a map with value type struct
normal_and_mmap_value would return the some_stats pointer value, and
likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
logic in this patch. We'd only be able to support one such mmapable kptr
field per mapval type, but that isn't a dealbreaker.

Some maps, like task_storage, would only support mmap() on a map_value
with mmapable kptr field, as mmap()ing the mapval itself doesn't make
sense or is unsafe. Seems like arraymap would do the opposite, only
supporting mmap()ing the mapval itself. I'm curious if any map could
feasibly support both, and if so, might have to do logic like:

  if (map_val has mmapable kptr)
     mmap the pointee of mmapable kptr
  else
     mmap the map_val itself

Which is maybe too confusing of a detail to expose to BPF program
writers. Maybe a little too presumptuous and brainstorm-ey given the
limited number of mmap()able maps currently, but making this a kptr type
means maps should either ignore/fail if they don't support it, or have
consistent semantics amongst maps that do support it.

Instead of  struct bpf_mmap_page __kptr *some_stats;  I'd prefer
something like

struct my_type { long count; long another_count; };

struct mmap_only_value {
  struct my_type __mmapable_kptr *some_stats;
};

This way the type of mmap()able field is known to BPF programs that
interact with it. This is all assuming that struct bpf_mmap_page is an
opaque page-sized blob of bytes.

We could then support structs like

struct mmap_value_and_lock {
  struct bpf_spin_lock l;
  int some_int;
  struct my_type __mmapable_kptr *some_stats;
};

and have bpf_map_update_elem handler use the spin_lock instead of
map-internal lock where appropriate. But no way to ensure userspace task
using the mmap()ed region uses the spin_lock.

>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>> index 146824cc9689..9b3becbcc1a3 100644
>> --- a/kernel/bpf/bpf_local_storage.c
>> +++ b/kernel/bpf/bpf_local_storage.c
>> @@ -15,7 +15,8 @@
>>   #include <linux/rcupdate_trace.h>
>>   #include <linux/rcupdate_wait.h>
>>   -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
>> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
>> +    (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>>     static struct bpf_local_storage_map_bucket *
>>   select_bucket(struct bpf_local_storage_map *smap,
>> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>>       return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>>   }
>>   +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
>> +
>> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
>> +{
>> +    struct mem_cgroup *memcg, *old_memcg;
>> +    void *ptr;
>> +
>> +    memcg = bpf_map_get_memcg(&smap->map);
>> +    old_memcg = set_active_memcg(memcg);
>> +    ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
>> +                      NUMA_NO_NODE);
>> +    set_active_memcg(old_memcg);
>> +    mem_cgroup_put(memcg);
>> +
>> +    return ptr;
>> +}
> 
> [ ... ]
> 
>> @@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>           void *value, bool charge_mem, gfp_t gfp_flags)
>>   {
>>       struct bpf_local_storage_elem *selem;
>> +    void *mmapable_value = NULL;
>> +    u32 selem_mem;
>>   -    if (charge_mem && mem_charge(smap, owner, smap->elem_size))
>> +    selem_mem = selem_bytes_used(smap);
>> +    if (charge_mem && mem_charge(smap, owner, selem_mem))
>>           return NULL;
>>   +    if (smap->map.map_flags & BPF_F_MMAPABLE) {
>> +        mmapable_value = alloc_mmapable_selem_value(smap);
> 
> This probably is not always safe for bpf prog to do. Leaving the gfp_flags was not used aside, the bpf local storage is moving to the bpf's memalloc because of https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@xxxxxxxxxx/
> 

Minor point: alloc_mmapable_selem_value's bpf_map_area_mmapable_alloc
call will always call vmalloc under the hood. vmalloc has locks as well,
so your point stands.

I think I see how this ties into your 'specific kptr type' proposal
above. Let me know if this sounds right: if there was a bpf_mem_alloc
type focused on providing vmalloc'd mmap()able memory, we could use it
here instead of raw vmalloc and avoid the lock recursion problem linked
above. Such an allocator could be used in something like bpf_obj_new to
create the __mmapable_kptr - either from BPF prog or mmap path before
remap_vmalloc_range.

re: gfp_flags, looks like verifier is setting this param to either
GFP_ATOMIC or GFP_KERNEL. Looks like we should not allow GFP_KERNEL
allocs here?

>> +        if (!mmapable_value)
>> +            goto err_out;
>> +    }
>> +
>