Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

David Marchevsky <david.marchevsky@xxxxxxxxx> · Mon, 11 Dec 2023 12:31:46 -0500

On 11/21/23 2:49 PM, Alexei Starovoitov wrote:
> On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>>
>> On 11/20/23 10:11 PM, David Marchevsky wrote:
>>>
>>>
>>> On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
>>>> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>>>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>>>> index 173ec7f43ed1..114973f925ea 100644
>>>>> --- a/include/linux/bpf_local_storage.h
>>>>> +++ b/include/linux/bpf_local_storage.h
>>>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>>>>         * the number of cachelines accessed during the cache hit case.
>>>>>         */
>>>>>        struct bpf_local_storage_map __rcu *smap;
>>>>> -    u8 data[] __aligned(8);
>>>>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>>>>> +     * it's time to free bpf_local_storage_data
>>>>> +     */
>>>>> +    u64 smap_map_flags;
>>>>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>>>>> +     * Otherwise the actual mapval data lives here
>>>>> +     */
>>>>> +    union {
>>>>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>>>>> +        void *actual_data __aligned(8);
>>>>
>>>> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
>>>>
>>>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
>>>>
>>>> struct normal_and_mmap_value {
>>>>      int some_int;
>>>>      int __percpu_kptr *some_cnts;
>>>>
>>>>      struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> struct mmap_only_value {
>>>>      struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> [ ... ]
>>>>
>>>
>>> This is an intriguing idea. For conciseness I'll call this specific
>>> kind of kptr 'mmapable kptrs' for the rest of this message. Below is
>>> more of a brainstorming dump than a cohesive response, separate trains
>>> of thought are separated by two newlines.
>>
>> Thanks for bearing with me while some ideas could be crazy. I am trying to see
>> how this would look like for other local storage, sk and inode. Allocating a
>> page for each sk will not be nice for server with half a million sk(s). e.g.
>> half a million sk(s) sharing a few bandwidth policies or a few tuning
>> parameters. Creating something mmap'able to the user space and also sharable
>> among many sk(s) will be useful.
>>
>>>
>>>
>>> My initial thought upon seeing struct normal_and_mmap_value was to note
>>> that we currently don't support mmaping for map_value types with _any_
>>> special fields ('special' as determined by btf_parse_fields). But IIUC
>>> you're actually talking about exposing the some_stats pointee memory via
>>> mmap, not the containing struct with kptr fields. That is, for maps that
>>> support these kptrs, mmap()ing a map with value type struct
>>> normal_and_mmap_value would return the some_stats pointer value, and
>>> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
>>> logic in this patch. We'd only be able to support one such mmapable kptr
>>> field per mapval type, but that isn't a dealbreaker.
>>>
>>> Some maps, like task_storage, would only support mmap() on a map_value
>>> with mmapable kptr field, as mmap()ing the mapval itself doesn't make
>>> sense or is unsafe. Seems like arraymap would do the opposite, only
>>
>> Changing direction a bit since arraymap is brought up. :)
>>
>> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an
>> arraymap as kptr, the bpf prog should be able to access it as a map. More like
>> the current map-in-map setup. The arraymap can be used as regular map in the
>> user space also (like pinning). It may need some btf plumbing to tell the value
>> type of the arrayamp to the verifier.
>>
>> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags)
>> can be used where the value->array_mmap initialized as an arraymap_fd. This will
>> limit the arraymap kptr update only from the syscall side which seems to be your
>> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts
>> and need a whitelist.
>>
>> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd,
>> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we
>> can create a libbpf helper to free the fd(s) resources held in the looked-up
>> value by using the value's btf.
>>
>> The bpf_local_storage_map side probably does not need to support mmap() then.
> 
> Martin,
> that's an interesting idea!
> I kinda like it and I think it's worth exploring further.
> 
> I think the main quirk of the proposed mmap-of-task-local-storage
> is using 'current' task as an implicit 'key' in task local storage map.
> It fits here, but I'm not sure it addresses sched-ext use case.
> 
> Tejun, David,
> could you please chime in ?
> Do you think mmap(..., task_local_storage_map_fd, ...)
> that returns a page that belongs to current task only is enough ?
> 
> If not we need to think through how to mmap local storage of other
> tasks. One proposal was to use pgoff to carry the key somehow
> like io-uring does, but if we want to generalize that the pgoff approach
> falls apart if we want __mmapable_kptr to work like Martin is proposing above,
> since the key will not fit in 64-bit of pgoff.
> 
> Maybe we need an office hours slot to discuss. This looks to be a big
> topic. Not sure we can converge over email.
> Just getting everyone on the same page will take a lot of email reading.

Meta BPF folks were all in one place for reasons unrelated to this
thread, and decided to have a design discussion regarding this mmapable
task_local storage implementation and other implementation ideas
discussed in this thread. I will summarize the discussion below. We
didn't arrive at a confident conclusion, though, so plenty of time for
others to chime in and for proper office hours discussion to happen if
necessary. Below, anything attributed to a specific person is
paraphrased from my notes, there will certainly be errors /
misrememberings.

mmapable task_local storage design discussion 11/29

We first summarized approaches that were discussed in this thread:

1) Current implementation in this series

2) pseudo-map_in_map, kptr arraymap type:

  struct mmapable_data_map {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __uint(map_flags, BPF_F_MMAPABLE);
          __uint(max_entries, 1);
          __type(key, __u32);
          __type(value, __u64);
  };

  struct my_mapval {
    int whatever;
    struct bpf_arraymap __kptr __arraymap_type(mmapable_data_map) *m;
    /* Need special logic to support getting / updating above field from userspace (as fd) */
  };

3) "mmapable page" kptr type, or "mmapable kptr" tag

  struct my_mapval {
    int whatever;
    struct bpf_mmapable_page *m;
  };

  struct my_mapval2 { /* Separate approach than my_mapval above */
    struct bpf_spin_lock l;
    int some_int;
    struct my_type __mmapable_kptr *some_stats;
  };

Points of consideration regardless of implementation approach:
  * mmap() syscall's return address must be page-aligned. If we want to
    reduce / eliminate wasted memory by supporting packing of multiple
    mapvals onto single page, need to be able to return non-page-aligned
    addr. Using a BPF syscall subcommand in lieu of or in addition to
    mmap() syscall would be a way to support this.
    * Dave wants to avoid implementing packing at all, but having a
      reasonable path forward would be nice in case actual usecase
      arises.
    * Andrii suggested a new actual mmap syscall supporting passing of
      custom params, useful for any subsystem using mmap in
      nontraditional ways. This was initially a response to "use offset
      to pass task id" discussion re: selecting non-current task.

   * How orthogonal is Martin's "kptr to arraymap" suggestion from the
     general mmapable local_storage goal? Is there a world where we
     choose a different approach for this work, and then to "kptr to
     arraymap" independently later?

The above was mostly summary of existing discussion in this thread. Rest
of discussion flowed from there.

Q&A:

- Do we need to be able to query other tasks' mapvals? (for David Vernet
  / Tejun Heo)

TJ had a usecase where this might've been necessary, but rewrote.
David: Being able to do this in general, aside from TJ's specific case,
would be useful. David provided an example from ghOSt project - central
scheduling. Another example: Folly runtime framework, farms out work to
worker threads, might want to tag them.

- Which usecases actually care about avoiding syscall overhead?

Alexei: Most services that would want to use this functionality to tag
their tasks don't need it, as they just set the value once.
Dave: Some tracing usecases (e.g. strobemeta) need it.

- Do we want to use mmap() in current-task-only limited way, or do BPF
  subcommand or something more exotic?

TJ: What if bpf subcommand returns FD that can be mmap'd. fd identifies
which task_local storage elem is mmap'd. Subcommand:
bpf_map_lookup_elem_fd(map *, u64 elem_id).
Alexei: Such a thing should return fd to arbitrary mapval, and should
support other common ops (open, close, etc.).
David: What's the problem w/ having fd that only supports mmap for now?
TJ: 'Dangling' fds exist in some usecases already

Discussion around the bpf_map_lookup_elem_fd idea continued for quite a
while. Folks liked that it avoids the "can only have one mmapable field"
issue from proposal (3) above, without making any implicit assumptions.

Alexei: Can we instead have pointer to userspace blob - similar to rseq
- to avoid wasting most of page?

TJ: My instinct is to stick to something more generic, would rather pay
4k.

Discussion around userspace pointer continued for a while as well,
ending in the conclusion that we should take a look at using
get_user_pages, perhaps wrapping such functionality in a 'guppable' kptr
type. Folks liked the 'guppable' idea as it sort-of avoids the wasted
memory issue, pushing the details to userspace, and punts on working out
a path forward for mmap interface, which other implementation ideas
require.

Action items based on convo, priority order:

  - Think more about / prototype 'guppable' kptr idea
  - If the above has issues, try bpf_map_lookup_elem_fd
  - If both above have issues, consider earlier approaches

I will start tackling these soon.