On Thu, Jan 26, 2023 at 04:42:20PM -0800, Hao Luo wrote: > On Thu, Jan 26, 2023 at 11:11 AM David Vernet <void@xxxxxxxxxxxxx> wrote: > > > > Hi everyone, > > > > Another proposal from me for LSF/MM/BPF, and the last one for the time > > being. I'd like to discuss enabling local-storage maps (e.g. > > BPF_MAP_TYPE_TASK_STORAGE and BPF_MAP_TYPE_CGRP_STORAGE) to be r/o > > mapped directly into user space. This would allow for quick lookups of > > per-object state from user space, similar to how we allow it for > > BPF_MAP_TYPE_ARRAY, without having to do something like either of the > > following: > > > > - Allocating a statically sized BPF_MAP_TYPE_ARRAY which is >= the # of > > possible local-storage elements, which is likely wasteful in terms of > > memory, and which isn't easy to iterate over. > > > > - Use something like https://docs.kernel.org/bpf/bpf_iterators.html to > > iterate over tasks or cgroups, and collect information for each which > > is then dumped to user space. This would probably work, but it's not > > terribly performant in that it requires copying memory, trapping into > > the kernel, and full iteration even when it's only necessary to look > > up e.g. a single element. > > > > Designing and implementing this would be pretty non-trivial. We'd have > > to probably do a few things: > > > > 1. Write an allocator that dynamically allocates statically sized > > local-storage entries for local-storage maps, and populates them into > > pages which are mapped into user space. > > > > 2. Come up with some idr-like mechanism for mapping a local-storage > > object to an index into the mapping. For example, mapping a task with > > global pid 12345 to BPF_MAP_TYPE_TASK_STORAGE index 5, and providing > > ergonomic and safe ways to update these entries in the kernel and > > communicate them to user space. > > > > 3. Related to point 1 above, come up with some way to dynamically extend > > the user space mapping as more local-storage elements are added. We > > could potentially reserve a statically sized VA range and map all > > unused VA pages to the zero page, or instead possibly just leave them > > unmapped until they're actually needed. > > > > There are a lot of open questions, but I think it could be very useful > > if we can make it work. Let me know what you all think. > > > > Hi David, > > I remember, I had a similar idea and played with it last year. I don't > recall why I needed that feature back then, probably looking for ways > to pass per-task information from userspace and read it from within > BPF. I sent an RFC to the mailing list [1]. You could take a look, see > whether it is of help to you. > > [1] https://www.spinics.net/lists/bpf/msg57565.html Hi Hao, Thanks for sharing that thread, it's great to see that there is already interest from other folks. It looks like the main use case you were trying to enable was passing an fd from user space to a TLS element for the current task, which the BPF prog would then pass to helpers that take an fd. There was a need specifically to enable this for unprivileged programs which can't e.g. use bpf_prog_test_run to set the fd. Alexei proposed an alternative option in [0] which it seemed like everyone was on-board with. [0]: https://lore.kernel.org/bpf/20220329232956.gbsr65jdbe4lw2m6@ast-mbp/ The use-case I was envisioning is a bit different in a couple ways: - I was anticipating that user space could map an entire task (or cgroup, sk, etc) local-storage map, rather than a task only being able to mmap its own TLS entry. I think this approach would be more generalizable for other local-storage map types like cgroup and sk, and would also be useful for ghOSt, sched_ext, etc. It could also serve as a higher-performance alternative to bpf-iter for user space applications that don't want to have to trap into the kernel. - I was envisioning this being read-only, though I expect it would be possible to enable writeable mappings as well. The tricky part here is that we will eventually certainly want to enable referenced kptrs in local-storage maps, so it's not always safe to let user space mutate local-storage entries. - I'd like to avoid allocating an entire page for each entry. Most of the local-storage entries that I've used are far smaller than a page, so it seems prudent to have some kind of allocator that we could use to pack multiple entries into a single page. This would also have the benefit of potentially allowing an O(1) lookup for a map entry as well, rather than requiring us to do an O(n) iteration over a task's local storage entries when doing a lookup from a program. This is a super hand-wavey claim though -- a lot more details and validation need to be ironed out. What do you think? Thanks, David