Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper

David Marchevsky <david.marchevsky@xxxxxxxxx> · Fri, 4 Aug 2023 02:59:13 -0400

On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>>
>> At Meta we have a profiling daemon which periodically collects
>> information on many hosts. This collection usually involves grabbing
>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>> those cases we must fall back to digging around in /proc/PID/maps to map
>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>> occur synchronously with stack collection, so the process might already
>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>> symbolicate.
>>
>> This 'exited process problem' doesn't occur very often as
>> most of the prod services we care to profile are long-lived daemons,
>> there are enough usecases to warrant a workaround: a BPF program which
>> can be optionally loaded at data collection time and essentially walks
>> /proc/PID/maps. Currently this is done by walking the vma list:
>>
>>   struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>>   mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>
>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>> longer a vma linked list to walk. Walking the vma maple tree is not as
>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>> returns the vma containing addr, or if no vma contains addr,
>> the closest vma with higher start addr.
>>
>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>> find_vma, with the major difference that no 'closest vma' is returned if
>> there is no VMA containing a particular address. This prevents BPF
>> programs from being able to use bpf_find_vma to iterate all vmas in a
>> task in a reasonable way.
>>
>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>> 'closest vma' behavior when used. Because this is find_vma's default
>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>> on find_vma retval.
>>
>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>> 'start'. The former is used in documentation and more accurately
>> describes the param.
>>
>> [
>>   RFC: This isn't an ideal solution for iteration of all vmas in a task
>>        in the long term for a few reasons:
>>
>>      * In nmi context, second call to bpf_find_vma will fail because
>>        irq_work is busy, so can't iterate all vmas
>>      * Repeatedly taking and releasing mmap_read lock when a dedicated
>>        iterate_all_vmas(task) kfunc could just take it once and hold for
>>        all vmas
>>
>>     My specific usecase doesn't do vma iteration in nmi context and I
>>     think the 'closest vma' behavior can be useful here despite locking
>>     inefficiencies.
>>
>>     When Alexei and I discussed this offline, two alternatives to
>>     provide similar functionality while addressing above issues seemed
>>     reasonable:
>>
>>       * open-coded iterator for task vma. Similar to existing
>>         task_vma bpf_iter, but no need to create a bpf_link and read
>>         bpf_iter fd from userspace.
>>       * New kfunc taking callback similar bpf_find_vma, but iterating
>>         over all vmas in one go
>>
>>      I think this patch is useful on its own since it's a fairly minimal
>>      change and fixes my usecase. Sending for early feedback and to
>>      solicit further thought about whether this should be dropped in
>>      favor of one of the above options.
> 
> - In theory this patch can work, but patch 2 didn't attempt to actually
> use it in a loop to iterate all vma-s.
> Which is a bit of red flag whether such iteration is practical
> (either via bpf_loop or bpf_for).
> 
> - This behavior of bpf_find_vma() feels too much implementation detail.
> find_vma will probably stay this way, since different parts of the kernel
> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
> 
> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
> I don't think bpf prog can do it on its own.
> Because with bpf_find_vma() the lock will drop at every step the problems
> described at that large comment will be hit sooner or later.
> 
> All concerns combined I feel we better provide a new kfunc that iterates vma
> and drops the lock before invoking callback.
> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
> Maybe it's ok.
> Doing it open coded iterators style is likely better.
> bpf_iter_vma_new() kfunc will do
> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
> 
> I'd try to do open-code-iter first. It's a good test for the iter infra.
> bpf_iter_testmod_seq_new is an example of how to add a new iter.
> 
> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
> It's not a trusted arg. We better move away from this legacy pointer.
> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
> fwiw bpf_get_current_task_btf_proto has
> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
> The bpf prog might look like:
> task = bpf_get_current_task_btf();
> err = bpf_iter_vma_new(&it, task);
> while ((vma = bpf_iter_vma_next(&it))) ...;
> assuming lock is not dropped by _next.

The only concern here that doesn't seem reasonable to me is the
"too much implementation detail". I agree with the rest, though,
so will send a different series with new implementation and point
 to this discussion.