David Vernet wrote: > On Thu, Nov 17, 2022 at 01:03:45PM -0800, John Fastabend wrote: > > David Vernet wrote: > > > Now that BPF supports adding new kernel functions with kfuncs, and > > > storing kernel objects in maps with kptrs, we can add a set of kfuncs > > > which allow struct task_struct objects to be stored in maps as > > > referenced kptrs. > > > > > > The possible use cases for doing this are plentiful. During tracing, > > > for example, it would be useful to be able to collect some tasks that > > > performed a certain operation, and then periodically summarize who they > > > are, which cgroup they're in, how much CPU time they've utilized, etc. > > > Doing this now would require storing the tasks' pids along with some > > > relevant data to be exported to user space, and later associating the > > > pids to tasks in other event handlers where the data is recorded. > > > Another useful by-product of this is that it allows a program to pin a > > > task in a BPF program, and by proxy therefore also e.g. pin its task > > > local storage. > > > > Sorry wasn't obvious to me (late to the party so if it was in some > > other v* described apologies). Can we say something about the life > > cycle of this acquired task_structs because they are incrementing > > the ref cnt on the task struct they have potential to impact system. > > We should probably add an entire docs page which describes how kptrs > work, and I am happy to do that (ideally in a follow-on patch set if > that's OK with you). In general I think it would be useful to include > docs for any general-purpose kfuncs like the ones proposed in this set. Sure, I wouldn't require that for your series though fwiw. > > In regards to your specific question about the task lifecycle, nothing > being proposed in this patch set differs from how kptr lifecycles work > in general. The idea is that the BPF program: > > 1. Gets a "kptr_ref" kptr from an "acquire" kfunc. > 2. Stores it in a map with bpf_kptr_xchg(). > > The program can then either later manually extract it from the map > (again with bpf_kptr_xchg()) and release it, or if the reference is > never removed from the map, let it be automatically released when the > map is destroyed. See [0] and [1] for a bit more information. Yep as long as the ref is decremented on map destroy and elem delete all good. > > [0]: https://docs.kernel.org/bpf/kfuncs.html?highlight=kptr#kf-acquire-flag > [1]: https://lwn.net/Articles/900749/ > > > I know at least we've had a few bugs in our task struct tracking > > that has led to various bugs where we leak references. In our case > > we didn't pin the kernel object so the leak is just BPF memory and > > user space memory, still sort of bad because we would hit memory > > limits and get OOM'd. Leaking kernel task structs is worse though. > > I don't think we need to worry about leaks. The verifier should fail to > load any program that doesn't properly release a kptr, and if it fails > to identify programs that improperly hold refcounts, that's a bug that > needs to be fixed. Similarly, if any map implementation (described > below) fails to properly free references at the appropriate time (when > an element or the map itself is destroyed), those are just bugs that > need to be fixed. > > I think the relevant tradeoff here is really the possible side effects > of keeping a task pinned and avoiding it being reaped. I agree that's an > important consideration, but I think that would arguably apply to any > kptr (modulo the size of the object being pinned, which is certainly > relevant as well). We already have kptrs for e.g. struct nf_conn [2]. > Granted, struct task_struct is significantly larger, but bpf_kptr_xchg() > is only enabled for privileged programs, so it seems like a reasonable > operation to allow. No not arguing it shouldn't be possible just didn't see the release hook. > > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/net/netfilter/nf_conntrack_bpf.c#n253 > > > quick question. If you put acquired task struct in a map what > > happens if user side deletes the entry? Presumably this causes the > > release to happen and the task_struct is good to go. Did I miss > > the logic? I was thinking you would have something in bpf_map_free_kptrs > > and a type callback to release() the refcnt? > > Someone else can chime in here to correct me if I'm wrong, but AFAIU > this is handled by the map implementations calling out to > bpf_obj_free_fields() to invoke the kptr destructor when the element is > destroyed. See [3] and [4] for examples of where they're called from the > arraymap and hashmap logic respectively. This is how the destructors are > similarly invoked when the maps are destroyed. Yep I found the dtor() gets populated in btf.c and apparently needed to repull my local tree because I missed it. Thanks for the detailed response. And last thing I was checking is because KF_SLEEPABLE is not set this should be blocked from running on sleepable progs which would break the call_rcu in the destructor. Maybe small nit, not sure its worth it but might be nice to annotate the helper description with a note, "will not work on sleepable progs" or something to that effect. Thanks. > > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/arraymap.c#n431 > [4]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/kernel/bpf/hashtab.c#n764 > > [...]