On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 11/20/23 10:11 PM, David Marchevsky wrote: > > > > > > On 11/20/23 7:42 PM, Martin KaFai Lau wrote: > >> On 11/20/23 9:59 AM, Dave Marchevsky wrote: > >>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > >>> index 173ec7f43ed1..114973f925ea 100644 > >>> --- a/include/linux/bpf_local_storage.h > >>> +++ b/include/linux/bpf_local_storage.h > >>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data { > >>> * the number of cachelines accessed during the cache hit case. > >>> */ > >>> struct bpf_local_storage_map __rcu *smap; > >>> - u8 data[] __aligned(8); > >>> + /* Need to duplicate smap's map_flags as smap may be gone when > >>> + * it's time to free bpf_local_storage_data > >>> + */ > >>> + u64 smap_map_flags; > >>> + /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data > >>> + * Otherwise the actual mapval data lives here > >>> + */ > >>> + union { > >>> + DECLARE_FLEX_ARRAY(u8, data) __aligned(8); > >>> + void *actual_data __aligned(8); > >> > >> The pages (that can be mmap'ed later) feel like a specific kind of kptr. > >> > >> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr. > >> > >> struct normal_and_mmap_value { > >> int some_int; > >> int __percpu_kptr *some_cnts; > >> > >> struct bpf_mmap_page __kptr *some_stats; > >> }; > >> > >> struct mmap_only_value { > >> struct bpf_mmap_page __kptr *some_stats; > >> }; > >> > >> [ ... ] > >> > > > > This is an intriguing idea. For conciseness I'll call this specific > > kind of kptr 'mmapable kptrs' for the rest of this message. Below is > > more of a brainstorming dump than a cohesive response, separate trains > > of thought are separated by two newlines. > > Thanks for bearing with me while some ideas could be crazy. I am trying to see > how this would look like for other local storage, sk and inode. Allocating a > page for each sk will not be nice for server with half a million sk(s). e.g. > half a million sk(s) sharing a few bandwidth policies or a few tuning > parameters. Creating something mmap'able to the user space and also sharable > among many sk(s) will be useful. > > > > > > > My initial thought upon seeing struct normal_and_mmap_value was to note > > that we currently don't support mmaping for map_value types with _any_ > > special fields ('special' as determined by btf_parse_fields). But IIUC > > you're actually talking about exposing the some_stats pointee memory via > > mmap, not the containing struct with kptr fields. That is, for maps that > > support these kptrs, mmap()ing a map with value type struct > > normal_and_mmap_value would return the some_stats pointer value, and > > likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE > > logic in this patch. We'd only be able to support one such mmapable kptr > > field per mapval type, but that isn't a dealbreaker. > > > > Some maps, like task_storage, would only support mmap() on a map_value > > with mmapable kptr field, as mmap()ing the mapval itself doesn't make > > sense or is unsafe. Seems like arraymap would do the opposite, only > > Changing direction a bit since arraymap is brought up. :) > > arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an > arraymap as kptr, the bpf prog should be able to access it as a map. More like > the current map-in-map setup. The arraymap can be used as regular map in the > user space also (like pinning). It may need some btf plumbing to tell the value > type of the arrayamp to the verifier. > > The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags) > can be used where the value->array_mmap initialized as an arraymap_fd. This will > limit the arraymap kptr update only from the syscall side which seems to be your > usecase also? Allocating the arraymap from the bpf prog side needs some thoughts > and need a whitelist. > > The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd, > &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we > can create a libbpf helper to free the fd(s) resources held in the looked-up > value by using the value's btf. > > The bpf_local_storage_map side probably does not need to support mmap() then. Martin, that's an interesting idea! I kinda like it and I think it's worth exploring further. I think the main quirk of the proposed mmap-of-task-local-storage is using 'current' task as an implicit 'key' in task local storage map. It fits here, but I'm not sure it addresses sched-ext use case. Tejun, David, could you please chime in ? Do you think mmap(..., task_local_storage_map_fd, ...) that returns a page that belongs to current task only is enough ? If not we need to think through how to mmap local storage of other tasks. One proposal was to use pgoff to carry the key somehow like io-uring does, but if we want to generalize that the pgoff approach falls apart if we want __mmapable_kptr to work like Martin is proposing above, since the key will not fit in 64-bit of pgoff. Maybe we need an office hours slot to discuss. This looks to be a big topic. Not sure we can converge over email. Just getting everyone on the same page will take a lot of email reading.