On Wed, Aug 23, 2023 at 7:58 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Aug 22, 2023 at 10:42 PM David Marchevsky > <david.marchevsky@xxxxxxxxx> wrote: > > > > On 8/22/23 8:04 PM, Andrii Nakryiko wrote: > > > On Tue, Aug 22, 2023 at 12:20 PM David Marchevsky > > > <david.marchevsky@xxxxxxxxx> wrote: > > >> > > >> On 8/22/23 1:42 PM, Yonghong Song wrote: > > >>> > > >>> > > >>> On 8/21/23 10:05 PM, Dave Marchevsky wrote: > > >>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow > > >>>> creation and manipulation of struct bpf_iter_task_vma in open-coded > > >>>> iterator style. BPF programs can use these kfuncs directly or through > > >>>> bpf_for_each macro for natural-looking iteration of all task vmas. > > >>>> > > >>>> The implementation borrows heavily from bpf_find_vma helper's locking - > > >>>> differing only in that it holds the mmap_read lock for all iterations > > >>>> while the helper only executes its provided callback on a maximum of 1 > > >>>> vma. Aside from locking, struct vma_iterator and vma_next do all the > > >>>> heavy lifting. > > >>>> > > >>>> The newly-added struct bpf_iter_task_vma has a name collision with a > > >>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs > > >>>> file is renamed in order to avoid the collision. > > >>>> > > >>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the > > >>>> only field in struct bpf_iter_task_vma. This is because the inner data > > >>>> struct contains a struct vma_iterator (not ptr), whose size is likely to > > >>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly > > >>>> such a change would require change in opaque bpf_iter_task_vma struct's > > >>>> size. So better to allocate vma_iterator using BPF allocator, and since > > >>>> that alloc must already succeed, might as well allocate all iter fields, > > >>>> thereby freezing struct bpf_iter_task_vma size. > > >>>> > > >>>> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> > > >>>> Cc: Nathan Slingerland <slinger@xxxxxxxx> > > >>>> --- > > >>>> include/uapi/linux/bpf.h | 4 + > > >>>> kernel/bpf/helpers.c | 3 + > > >>>> kernel/bpf/task_iter.c | 84 +++++++++++++++++++ > > >>>> tools/include/uapi/linux/bpf.h | 4 + > > >>>> tools/lib/bpf/bpf_helpers.h | 8 ++ > > >>>> .../selftests/bpf/prog_tests/bpf_iter.c | 26 +++--- > > >>>> ...f_iter_task_vma.c => bpf_iter_task_vmas.c} | 0 > > >>>> 7 files changed, 116 insertions(+), 13 deletions(-) > > >>>> rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%) > > >>>> > > >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > >>>> index 8790b3962e4b..49fc1989a548 100644 > > >>>> --- a/include/uapi/linux/bpf.h > > >>>> +++ b/include/uapi/linux/bpf.h > > >>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num { > > >>>> __u64 __opaque[1]; > > >>>> } __attribute__((aligned(8))); > > >>>> +struct bpf_iter_task_vma { > > >>>> + __u64 __opaque[1]; /* See bpf_iter_num comment above */ > > >>>> +} __attribute__((aligned(8))); > > >>> > > >>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct > > >>> like > > >>> struct bpf_iter_<...> { > > >>> __u64 __opaque[1]; > > >>> } __attribute__((aligned(8))); > > >>> > > >>> Maybe we want a generic one instead of having lots of > > >>> structs with the same underline definition? For example, > > >>> struct bpf_iter_generic > > >>> ? > > >>> > > >> > > >> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct > > >> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types > > >> of iters would break the scheme. We could: > > >> > > >> * Add bpf_for_each_generic that only uses bpf_iter_generic > > >> * This exposes implementation details in an ugly way, though. > > >> * Do some macro magic to pick bpf_iter_generic for some types of iters, and > > >> use consistent naming pattern for others. > > >> * I'm not sure how to do this with preprocessor > > >> * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd > > >> data struct, and use bpf_iter_generic for all of them > > >> * Probably need to see more iter implementation / usage before making such > > >> a change > > >> * Do 'typedef __u64 __aligned(8) bpf_iter_<...> > > >> * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier > > >> logic. Could do similar typedef w/ struct to try to work around > > >> it. > > >> > > >> Let me know what you think. Personally I considered doing typedef while > > >> implementing this, so that's the alternative I'd choose. > > >> > > >>>> + > > >>>> #endif /* _UAPI__LINUX_BPF_H__ */ > > >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > >>>> index eb91cae0612a..7a06dea749f1 100644 > > >>>> --- a/kernel/bpf/helpers.c > > >>>> +++ b/kernel/bpf/helpers.c > > >>>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL) > > >>>> BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW) > > >>>> BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL) > > >>>> BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY) > > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW) > > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL) > > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) > > >>>> BTF_ID_FLAGS(func, bpf_dynptr_adjust) > > >>>> BTF_ID_FLAGS(func, bpf_dynptr_is_null) > > >>>> BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) > > >>>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c > > >>>> index c4ab9d6cdbe9..51c2dce435c1 100644 > > >>>> --- a/kernel/bpf/task_iter.c > > >>>> +++ b/kernel/bpf/task_iter.c > > >>>> @@ -7,7 +7,9 @@ > > >>>> #include <linux/fs.h> > > >>>> #include <linux/fdtable.h> > > >>>> #include <linux/filter.h> > > >>>> +#include <linux/bpf_mem_alloc.h> > > >>>> #include <linux/btf_ids.h> > > >>>> +#include <linux/mm_types.h> > > >>>> #include "mmap_unlock_work.h" > > >>>> static const char * const iter_task_type_names[] = { > > >>>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = { > > >>>> .arg5_type = ARG_ANYTHING, > > >>>> }; > > >>>> +struct bpf_iter_task_vma_kern_data { > > >>>> + struct task_struct *task; > > >>>> + struct mm_struct *mm; > > >>>> + struct mmap_unlock_irq_work *work; > > >>>> + struct vma_iterator vmi; > > >>>> +}; > > >>>> + > > >>>> +/* Non-opaque version of uapi bpf_iter_task_vma */ > > >>>> +struct bpf_iter_task_vma_kern { > > >>>> + struct bpf_iter_task_vma_kern_data *data; > > >>>> +} __attribute__((aligned(8))); > > >>>> + > > >>>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, > > >>>> + struct task_struct *task, u64 addr) > > >>>> +{ > > >>>> + struct bpf_iter_task_vma_kern *kit = (void *)it; > > >>>> + bool irq_work_busy = false; > > >>>> + int err; > > >>>> + > > >>>> + BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma)); > > >>>> + BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma)); > > >>>> + > > >>>> + /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized > > >>>> + * before, so non-NULL kit->data doesn't point to previously > > >>>> + * bpf_mem_alloc'd bpf_iter_task_vma_kern_data > > >>>> + */ > > >>>> + kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data)); > > >>>> + if (!kit->data) > > >>>> + return -ENOMEM; > > >>>> + kit->data->task = NULL; > > >>>> + > > >>>> + if (!task) { > > >>>> + err = -ENOENT; > > >>>> + goto err_cleanup_iter; > > >>>> + } > > >>>> + > > >>>> + kit->data->task = get_task_struct(task); > > >>> > > >>> The above is not safe. Since there is no restriction on 'task', > > >>> the 'task' could be in a state for destruction with 'usage' count 0 > > >>> and then get_task_struct(task) won't work since it unconditionally > > >>> tries to increase 'usage' count from 0 to 1. > > >>> > > >>> Or, 'task' may be valid at the entry of the funciton, but when > > >>> 'task' is in get_task_struct(), 'task' may have been destroyed > > >>> and 'task' memory is reused by somebody else. > > >>> > > >>> I suggest that we check input parameter 'task' must be > > >>> PTR_TRUSTED or MEM_RCU. This way, the above !task checking > > >>> is not necessary and get_task_struct() can correctly > > >>> hold a reference to 'task'. > > >>> > > >> > > >> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious > > >> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID > > >> to this kfunc currently. > > >> > > >> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID > > >> * ptr hop from trusted task_struct to 'real_parent' or similar should > > >> yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def > > >> * if task kptr is in map_val, direct reference to it should result > > >> in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again > > >> using bpf_task_from_pid (?) > > >> > > >> But regardless, better to be explicit. Will change. > > > > > > How horrible would it be to base an interface on TID/PID (i.e., int) > > > as input argument to specify a task? I'm just thinking it might be > > > more generic and easy to use in more situations: > > > - for all the cases where we have struct task_struct, getting its > > > pid is trivial: `task->pid`; > > > - but in some situations PID might be coming from outside: either > > > as an argument to CLI tool, or from old-style tracepoint (e.g., > > > context_switch where we have prev/next task pid), etc. > > > > > > The downside is that we'd need to look up a task, right? But on the > > > other hand we get more generality and won't have to rely on having > > > PTR_TRUSTED task_struct. > > > > > > Thoughts? > > > > > > > Meh, taking tid here feels like the 'old-school' approach, before recent > > efforts to teach the verifier more about resource acquisition, locking, > > iteration, trustedness, etc. All allowing us to push more important logic > > out of 'opaque' helper impl and into BPF programs. > > > > In this tid -> struct task_struct case I think the provided resource acquisition > > is sufficient. Using your examples: > > > > * We have a TRUSTED or RCU struct task_struct > > * No need to do anything, can pass to bpf_iter_task_vma_new > > > > * We have a struct task_struct, but it's UNTRUSTED or has no trustedness > > type flags > > * Use bpf_task_acquire or bpf_task_from_pid > > > > * We just have a pid ('coming from outside') > > * Use bpf_task_from_pid > > > > If there is some scenario where we can't get from pid to properly acquired task > > in the BPF program, let's improve the resource acquisition instead of pushing > > it into the kfunc impl. > > > > Also, should we look up (and refcount incr) the task using task->rcu_users > > refcount w/ bpf_task_{acquire,release}, or using task->usage refcount w/ > > {get,put}_task_struct ? More generally, if there are >1 valid ways to acquire > > task_struct or some other resource, pushing acquisition to the BPF program > > gives us the benefit of not having to pick one (or do possibly ugly / complex > > flags). As long as type flags express that the resource will not go away, > > this kfunc impl can ignore the details of how that property came about. > > Agree with above reasoning. pid->task should be separate set of kfunc. > It's not a trivial task either. there are pid namespaces. there is u32 pid > and there is 'struct pid'. u32_pid->struct_pid->task has to be designed first. > Probably in some case the trusted task pointer is ready to be accessed, > but context could be such that pid->task cannot be done. makes sense, having bpf_task_from_pid() removes my concerns, I forgot about it.