Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 7, 2024 at 11:10 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Andrii Nakryiko <andrii@xxxxxxxxxx> [240503 20:30]:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers. They would
> > normally capture stack trace containing absolute memory addresses of
> > some functions, and would then use /proc/<pid>/maps file to file
> > corresponding backing ELF files, file offsets within them, and then
> > continue from there to get yet more information (ELF symbols, DWARF
> > information) to get human-readable symbolic information.
> >
> > As such, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries).
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above 100%.
> >
> > First, it's text based, which makes its programmatic use from
> > applications and libraries unnecessarily cumbersome and slow due to the
> > need to do text parsing to get necessary pieces of information.
> >
> > Second, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a small subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
> >
> > Another problem when writing generic stack trace symbolization library
> > is an unfortunate performance-vs-correctness tradeoff that needs to be
> > made. Library has to make a decision to either cache parsed contents of
> > /proc/<pid>/maps for service future requests (if application requests to
> > symbolize another set of addresses, captured at some later time, which
> > is typical for periodic/continuous profiling cases) to avoid higher
> > costs of needed to re-parse this file or caching the contents in memory
> > to speed up future requests. In the former case, more memory is used for
> > the cache and there is a risk of getting stale data if application
> > loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> > through additiona mmap() calls (and other means of altering memory
> > address space). In the latter case, it's the performance hit that comes
> > from re-opening the file and re-reading/re-parsing its contents all over
> > again.
> >
> > This patch aims to solve this problem by providing a new API built on
> > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> > interface, avoiding the cost and awkwardness of textual representation
> > for programmatic use. It's designed to be extensible and
> > forward/backward compatible by including user-specified field size and
> > using copy_struct_from_user() approach. But, most importantly, it allows
> > to do point queries for specific single address, specified by user. And
> > this is done efficiently using VMA iterator.
> >
> > User has a choice to pick either getting VMA that covers provided
> > address or -ENOENT if none is found (exact, least surprising, case). Or,
> > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> > get either VMA that covers the address (if there is one), or the closest
> > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> > more efficient use, but, given it could be a surprising behavior,
> > requires an explicit opt-in.
> >
> > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> > sense given it's querying the same set of VMA data. All the permissions
> > checks performed on /proc/<pid>/maps opening fit here as well.
> > ioctl-based implementation is fetching remembered mm_struct reference,
> > but otherwise doesn't interfere with seq_file-based implementation of
> > /proc/<pid>/maps textual interface, and so could be used together or
> > independently without paying any price for that.
> >
> > There is one extra thing that /proc/<pid>/maps doesn't currently
> > provide, and that's an ability to fetch ELF build ID, if present. User
> > has control over whether this piece of information is requested or not
> > by either setting build_id_size field to zero or non-zero maximum buffer
> > size they provided through build_id_addr field (which encodes user
> > pointer as __u64 field).
> >
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this,
> > requiring applications to open underlying ELF binary through
> > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> > permissions implications due giving a full access to the binary from
> > (potentially) another process, while all application is interested in is
> > build ID. Giving an ability to request just build ID doesn't introduce
> > any additional security concerns, on top of what /proc/<pid>/maps is
> > already concerned with, simplifying the overall logic.
> >
> > Kernel already implements build ID fetching, which is used from BPF
> > subsystem. We are reusing this code here, but plan a follow up changes
> > to make it work better under more relaxed assumption (compared to what
> > existing code assumes) of being called from user process context, in
> > which page faults are allowed. BPF-specific implementation currently
> > bails out if necessary part of ELF file is not paged in, all due to
> > extra BPF-specific restrictions (like the need to fetch build ID in
> > restrictive contexts such as NMI handler).
> >
> > Note also, that fetching VMA name (e.g., backing file path, or special
> > hard-coded or user-provided names) is optional just like build ID. If
> > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > it, saving resources.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/fs.h |  32 ++++++++
> >  2 files changed, 197 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8e503a1635b7..cb7b1ff1a144 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/pkeys.h>
> >  #include <linux/minmax.h>
> >  #include <linux/overflow.h>
> > +#include <linux/buildid.h>
> >
> >  #include <asm/elf.h>
> >  #include <asm/tlb.h>
> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> >       return do_maps_open(inode, file, &proc_pid_maps_op);
> >  }
> >
> > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > +{
> > +     struct procfs_procmap_query karg;
> > +     struct vma_iterator iter;
> > +     struct vm_area_struct *vma;
> > +     struct mm_struct *mm;
> > +     const char *name = NULL;
> > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > +     __u64 usize;
> > +     int err;
> > +
> > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > +             return -EFAULT;
> > +     if (usize > PAGE_SIZE)
> > +             return -E2BIG;
> > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > +             return -EINVAL;
> > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > +     if (err)
> > +             return err;
> > +
> > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > +             return -EINVAL;
> > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > +             return -EINVAL;
> > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > +             return -EINVAL;
> > +
> > +     mm = priv->mm;
> > +     if (!mm || !mmget_not_zero(mm))
> > +             return -ESRCH;
> > +     if (mmap_read_lock_killable(mm)) {
> > +             mmput(mm);
> > +             return -EINTR;
> > +     }
>
> Using the rcu lookup here will allow for more success rate with less
> lock contention.
>

If you have any code pointers, I'd appreciate it. If not, I'll try to
find it myself, no worries.

> > +
> > +     vma_iter_init(&iter, mm, karg.query_addr);
> > +     vma = vma_next(&iter);
> > +     if (!vma) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
> > +     /* user wants covering VMA, not the closest next one */
> > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > +         vma->vm_start > karg.query_addr) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
>
> The interface you are using is a start address to search from to the end
> of the address space, so this won't work as you intended with the
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag.  I do not think the vma iterator

Maybe the name isn't the best, by "EXACT" here I meant "VMA that
exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA"
would be better wording.

With that out of the way, I think this API works exactly how I expect
it to work:

# cat /proc/3406/maps | grep -C1 7f42099fe000
7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0
7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

# cat addrs.txt
0x7f42099fe010

# ./procfs_query -f addrs.txt -p 3406 -v -Q
PID: 3406
PATH: addrs.txt
READ 1 addrs!
SORTED ADDRS (1):
ADDR #0: 0x7f42099fe010
VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000
00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
(build ID: NO, 0 bytes)
RESOLVED ADDRS (1):
RESOLVED   #0: 0x7f42099fe010 -> OFF 0x2010 NAME
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

You can see above that for the requested 0x7f42099fe010 address we got
a VMA that starts before this address: 7f42099fe000-7f4209a0e000,
which is what we want.

Before submitting I ran the tool with /proc/<pid>/maps and ioctl to
"resolve" the exact same set of addresses and I compared results. They
were identical.


Note, there is a small bug in the tool I added in patch #5. I changed
`-i` argument to `-Q` at the very last moment and haven't updated the
code in one place. But other than that I didn't change anything. For
the above output, I added "VMA FOUND" verbose logging to see all the
details of VMA, not just resolved offset. I'll add that in v2.

> has the desired interface you want as the single address lookup doesn't
> use the vma iterator.  I'd just run the vma_next() and check the limits.
> See find_exact_vma() for the limit checks.
>
> > +
> > +     karg.vma_start = vma->vm_start;
> > +     karg.vma_end = vma->vm_end;
> > +

[...]





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux