On 11/21/23 2:49 PM, Alexei Starovoitov wrote: > On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: >> >> On 11/20/23 10:11 PM, David Marchevsky wrote: >>> >>> >>> On 11/20/23 7:42 PM, Martin KaFai Lau wrote: >>>> On 11/20/23 9:59 AM, Dave Marchevsky wrote: >>>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h >>>>> index 173ec7f43ed1..114973f925ea 100644 >>>>> --- a/include/linux/bpf_local_storage.h >>>>> +++ b/include/linux/bpf_local_storage.h >>>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data { >>>>> * the number of cachelines accessed during the cache hit case. >>>>> */ >>>>> struct bpf_local_storage_map __rcu *smap; >>>>> - u8 data[] __aligned(8); >>>>> + /* Need to duplicate smap's map_flags as smap may be gone when >>>>> + * it's time to free bpf_local_storage_data >>>>> + */ >>>>> + u64 smap_map_flags; >>>>> + /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data >>>>> + * Otherwise the actual mapval data lives here >>>>> + */ >>>>> + union { >>>>> + DECLARE_FLEX_ARRAY(u8, data) __aligned(8); >>>>> + void *actual_data __aligned(8); >>>> >>>> The pages (that can be mmap'ed later) feel like a specific kind of kptr. >>>> >>>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr. >>>> >>>> struct normal_and_mmap_value { >>>> int some_int; >>>> int __percpu_kptr *some_cnts; >>>> >>>> struct bpf_mmap_page __kptr *some_stats; >>>> }; >>>> >>>> struct mmap_only_value { >>>> struct bpf_mmap_page __kptr *some_stats; >>>> }; >>>> >>>> [ ... ] >>>> >>> >>> This is an intriguing idea. For conciseness I'll call this specific >>> kind of kptr 'mmapable kptrs' for the rest of this message. Below is >>> more of a brainstorming dump than a cohesive response, separate trains >>> of thought are separated by two newlines. >> >> Thanks for bearing with me while some ideas could be crazy. I am trying to see >> how this would look like for other local storage, sk and inode. Allocating a >> page for each sk will not be nice for server with half a million sk(s). e.g. >> half a million sk(s) sharing a few bandwidth policies or a few tuning >> parameters. Creating something mmap'able to the user space and also sharable >> among many sk(s) will be useful. >> >>> >>> >>> My initial thought upon seeing struct normal_and_mmap_value was to note >>> that we currently don't support mmaping for map_value types with _any_ >>> special fields ('special' as determined by btf_parse_fields). But IIUC >>> you're actually talking about exposing the some_stats pointee memory via >>> mmap, not the containing struct with kptr fields. That is, for maps that >>> support these kptrs, mmap()ing a map with value type struct >>> normal_and_mmap_value would return the some_stats pointer value, and >>> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE >>> logic in this patch. We'd only be able to support one such mmapable kptr >>> field per mapval type, but that isn't a dealbreaker. >>> >>> Some maps, like task_storage, would only support mmap() on a map_value >>> with mmapable kptr field, as mmap()ing the mapval itself doesn't make >>> sense or is unsafe. Seems like arraymap would do the opposite, only >> >> Changing direction a bit since arraymap is brought up. :) >> >> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an >> arraymap as kptr, the bpf prog should be able to access it as a map. More like >> the current map-in-map setup. The arraymap can be used as regular map in the >> user space also (like pinning). It may need some btf plumbing to tell the value >> type of the arrayamp to the verifier. >> >> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags) >> can be used where the value->array_mmap initialized as an arraymap_fd. This will >> limit the arraymap kptr update only from the syscall side which seems to be your >> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts >> and need a whitelist. >> >> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd, >> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we >> can create a libbpf helper to free the fd(s) resources held in the looked-up >> value by using the value's btf. >> >> The bpf_local_storage_map side probably does not need to support mmap() then. > > Martin, > that's an interesting idea! > I kinda like it and I think it's worth exploring further. > > I think the main quirk of the proposed mmap-of-task-local-storage > is using 'current' task as an implicit 'key' in task local storage map. > It fits here, but I'm not sure it addresses sched-ext use case. > > Tejun, David, > could you please chime in ? > Do you think mmap(..., task_local_storage_map_fd, ...) > that returns a page that belongs to current task only is enough ? > > If not we need to think through how to mmap local storage of other > tasks. One proposal was to use pgoff to carry the key somehow > like io-uring does, but if we want to generalize that the pgoff approach > falls apart if we want __mmapable_kptr to work like Martin is proposing above, > since the key will not fit in 64-bit of pgoff. > > Maybe we need an office hours slot to discuss. This looks to be a big > topic. Not sure we can converge over email. > Just getting everyone on the same page will take a lot of email reading. Meta BPF folks were all in one place for reasons unrelated to this thread, and decided to have a design discussion regarding this mmapable task_local storage implementation and other implementation ideas discussed in this thread. I will summarize the discussion below. We didn't arrive at a confident conclusion, though, so plenty of time for others to chime in and for proper office hours discussion to happen if necessary. Below, anything attributed to a specific person is paraphrased from my notes, there will certainly be errors / misrememberings. mmapable task_local storage design discussion 11/29 We first summarized approaches that were discussed in this thread: 1) Current implementation in this series 2) pseudo-map_in_map, kptr arraymap type: struct mmapable_data_map { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(map_flags, BPF_F_MMAPABLE); __uint(max_entries, 1); __type(key, __u32); __type(value, __u64); }; struct my_mapval { int whatever; struct bpf_arraymap __kptr __arraymap_type(mmapable_data_map) *m; /* Need special logic to support getting / updating above field from userspace (as fd) */ }; 3) "mmapable page" kptr type, or "mmapable kptr" tag struct my_mapval { int whatever; struct bpf_mmapable_page *m; }; struct my_mapval2 { /* Separate approach than my_mapval above */ struct bpf_spin_lock l; int some_int; struct my_type __mmapable_kptr *some_stats; }; Points of consideration regardless of implementation approach: * mmap() syscall's return address must be page-aligned. If we want to reduce / eliminate wasted memory by supporting packing of multiple mapvals onto single page, need to be able to return non-page-aligned addr. Using a BPF syscall subcommand in lieu of or in addition to mmap() syscall would be a way to support this. * Dave wants to avoid implementing packing at all, but having a reasonable path forward would be nice in case actual usecase arises. * Andrii suggested a new actual mmap syscall supporting passing of custom params, useful for any subsystem using mmap in nontraditional ways. This was initially a response to "use offset to pass task id" discussion re: selecting non-current task. * How orthogonal is Martin's "kptr to arraymap" suggestion from the general mmapable local_storage goal? Is there a world where we choose a different approach for this work, and then to "kptr to arraymap" independently later? The above was mostly summary of existing discussion in this thread. Rest of discussion flowed from there. Q&A: - Do we need to be able to query other tasks' mapvals? (for David Vernet / Tejun Heo) TJ had a usecase where this might've been necessary, but rewrote. David: Being able to do this in general, aside from TJ's specific case, would be useful. David provided an example from ghOSt project - central scheduling. Another example: Folly runtime framework, farms out work to worker threads, might want to tag them. - Which usecases actually care about avoiding syscall overhead? Alexei: Most services that would want to use this functionality to tag their tasks don't need it, as they just set the value once. Dave: Some tracing usecases (e.g. strobemeta) need it. - Do we want to use mmap() in current-task-only limited way, or do BPF subcommand or something more exotic? TJ: What if bpf subcommand returns FD that can be mmap'd. fd identifies which task_local storage elem is mmap'd. Subcommand: bpf_map_lookup_elem_fd(map *, u64 elem_id). Alexei: Such a thing should return fd to arbitrary mapval, and should support other common ops (open, close, etc.). David: What's the problem w/ having fd that only supports mmap for now? TJ: 'Dangling' fds exist in some usecases already Discussion around the bpf_map_lookup_elem_fd idea continued for quite a while. Folks liked that it avoids the "can only have one mmapable field" issue from proposal (3) above, without making any implicit assumptions. Alexei: Can we instead have pointer to userspace blob - similar to rseq - to avoid wasting most of page? TJ: My instinct is to stick to something more generic, would rather pay 4k. Discussion around userspace pointer continued for a while as well, ending in the conclusion that we should take a look at using get_user_pages, perhaps wrapping such functionality in a 'guppable' kptr type. Folks liked the 'guppable' idea as it sort-of avoids the wasted memory issue, pushing the details to userspace, and punts on working out a path forward for mmap interface, which other implementation ideas require. Action items based on convo, priority order: - Think more about / prototype 'guppable' kptr idea - If the above has issues, try bpf_map_lookup_elem_fd - If both above have issues, consider earlier approaches I will start tackling these soon.