Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 11/20/23 10:11 PM, David Marchevsky wrote:
> >
> >
> > On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
> >> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
> >>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> >>> index 173ec7f43ed1..114973f925ea 100644
> >>> --- a/include/linux/bpf_local_storage.h
> >>> +++ b/include/linux/bpf_local_storage.h
> >>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
> >>>         * the number of cachelines accessed during the cache hit case.
> >>>         */
> >>>        struct bpf_local_storage_map __rcu *smap;
> >>> -    u8 data[] __aligned(8);
> >>> +    /* Need to duplicate smap's map_flags as smap may be gone when
> >>> +     * it's time to free bpf_local_storage_data
> >>> +     */
> >>> +    u64 smap_map_flags;
> >>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> >>> +     * Otherwise the actual mapval data lives here
> >>> +     */
> >>> +    union {
> >>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> >>> +        void *actual_data __aligned(8);
> >>
> >> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
> >>
> >> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
> >>
> >> struct normal_and_mmap_value {
> >>      int some_int;
> >>      int __percpu_kptr *some_cnts;
> >>
> >>      struct bpf_mmap_page __kptr *some_stats;
> >> };
> >>
> >> struct mmap_only_value {
> >>      struct bpf_mmap_page __kptr *some_stats;
> >> };
> >>
> >> [ ... ]
> >>
> >
> > This is an intriguing idea. For conciseness I'll call this specific
> > kind of kptr 'mmapable kptrs' for the rest of this message. Below is
> > more of a brainstorming dump than a cohesive response, separate trains
> > of thought are separated by two newlines.
>
> Thanks for bearing with me while some ideas could be crazy. I am trying to see
> how this would look like for other local storage, sk and inode. Allocating a
> page for each sk will not be nice for server with half a million sk(s). e.g.
> half a million sk(s) sharing a few bandwidth policies or a few tuning
> parameters. Creating something mmap'able to the user space and also sharable
> among many sk(s) will be useful.
>
> >
> >
> > My initial thought upon seeing struct normal_and_mmap_value was to note
> > that we currently don't support mmaping for map_value types with _any_
> > special fields ('special' as determined by btf_parse_fields). But IIUC
> > you're actually talking about exposing the some_stats pointee memory via
> > mmap, not the containing struct with kptr fields. That is, for maps that
> > support these kptrs, mmap()ing a map with value type struct
> > normal_and_mmap_value would return the some_stats pointer value, and
> > likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
> > logic in this patch. We'd only be able to support one such mmapable kptr
> > field per mapval type, but that isn't a dealbreaker.
> >
> > Some maps, like task_storage, would only support mmap() on a map_value
> > with mmapable kptr field, as mmap()ing the mapval itself doesn't make
> > sense or is unsafe. Seems like arraymap would do the opposite, only
>
> Changing direction a bit since arraymap is brought up. :)
>
> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an
> arraymap as kptr, the bpf prog should be able to access it as a map. More like
> the current map-in-map setup. The arraymap can be used as regular map in the
> user space also (like pinning). It may need some btf plumbing to tell the value
> type of the arrayamp to the verifier.
>
> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags)
> can be used where the value->array_mmap initialized as an arraymap_fd. This will
> limit the arraymap kptr update only from the syscall side which seems to be your
> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts
> and need a whitelist.
>
> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd,
> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we
> can create a libbpf helper to free the fd(s) resources held in the looked-up
> value by using the value's btf.
>
> The bpf_local_storage_map side probably does not need to support mmap() then.

Martin,
that's an interesting idea!
I kinda like it and I think it's worth exploring further.

I think the main quirk of the proposed mmap-of-task-local-storage
is using 'current' task as an implicit 'key' in task local storage map.
It fits here, but I'm not sure it addresses sched-ext use case.

Tejun, David,
could you please chime in ?
Do you think mmap(..., task_local_storage_map_fd, ...)
that returns a page that belongs to current task only is enough ?

If not we need to think through how to mmap local storage of other
tasks. One proposal was to use pgoff to carry the key somehow
like io-uring does, but if we want to generalize that the pgoff approach
falls apart if we want __mmapable_kptr to work like Martin is proposing above,
since the key will not fit in 64-bit of pgoff.

Maybe we need an office hours slot to discuss. This looks to be a big
topic. Not sure we can converge over email.
Just getting everyone on the same page will take a lot of email reading.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux