Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> · Mon, 6 May 2024 15:53:40 -0300

On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.

> > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>

> > > > Where is the userspace code that uses this new api you have created?

> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.

> I think the general use case in perf is different.  This ioctl API is great
> for live tracing of a single (or a small number of) process(es).  And
> yes, perf tools have those tracing use cases too.  But I think the
> major use case of perf tools is system-wide profiling.

> For system-wide profiling, you need to process samples of many
> different processes at a high frequency.  Now perf record doesn't
> process them and just save it for offline processing (well, it does
> at the end to find out build-ID but it can be omitted).

Since:

  Author: Jiri Olsa <jolsa@xxxxxxxxxx>
  Date:   Mon Dec 14 11:54:49 2020 +0100
  1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")

We don't need to to process the events to find the build ids. I haven't
checked if we still do it to find out which DSOs had hits, but we
shouldn't need to do it for build-ids (unless they were not in memory
when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
haven't checked but IIRC is a possibility if that ELF part isn't in
memory at the time we want to copy it).

If we're still traversing it like that I guess we can have a knob and
make it the default to not do that and instead create the perf.data
build ID header table with all the build-ids we got from
PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
processing at the end of a 'perf record' session.

> Doing it online is possible (like perf top) but it would add more
> overhead during the profiling.  And we cannot move processing

It comes in the PERF_RECORD_MMAP2, filled by the kernel.

> or symbolization to the end of profiling because some (short-
> lived) tasks can go away.

right

> Also it should support perf report (offline) on data from a
> different kernel or even a different machine.

right

> So it saves the memory map of processes and symbolizes
> the stack trace with it later.  Of course it needs to be updated
> as the memory map changes and that's why it tracks mmap
> or similar syscalls with PERF_RECORD_MMAP[2] records.

> A problem with this approach is to get the initial state of all
> (or a target for non-system-wide mode) existing processes.
> We call it synthesizing, and read /proc/PID/maps to generate
> the mmap records.

> I think the below comment from Arnaldo talked about how
> we can improve the synthesizing (which is sequential access
> to proc maps) using BPF.

Yes, I wonder how far Jiri went, Jiri?

- Arnaldo

> Thanks,
> Namhyung
> 
> 
> >
> > At some point, when BPF iterators became a thing we thought about, IIRC
> > Jiri did some experimentation, but I lost track, of using BPF to
> > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > as in uapi/linux/perf_event.h:
> >
> >         /*
> >          * The MMAP2 records are an augmented version of MMAP, they add
> >          * maj, min, ino numbers to be used to uniquely identify each mapping
> >          *
> >          * struct {
> >          *      struct perf_event_header        header;
> >          *
> >          *      u32                             pid, tid;
> >          *      u64                             addr;
> >          *      u64                             len;
> >          *      u64                             pgoff;
> >          *      union {
> >          *              struct {
> >          *                      u32             maj;
> >          *                      u32             min;
> >          *                      u64             ino;
> >          *                      u64             ino_generation;
> >          *              };
> >          *              struct {
> >          *                      u8              build_id_size;
> >          *                      u8              __reserved_1;
> >          *                      u16             __reserved_2;
> >          *                      u8              build_id[20];
> >          *              };
> >          *      };
> >          *      u32                             prot, flags;
> >          *      char                            filename[];
> >          *      struct sample_id                sample_id;
> >          * };
> >          */
> >         PERF_RECORD_MMAP2                       = 10,
> >
> >  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
> >
> > As perf.data files can be used for many purposes we want them all, so we
> > setup a meta data perf file descriptor to go on receiving the new mmaps
> > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > it in parallel, etc:
> >
> > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> >
> >  Usage: perf record [<options>] [<command>]
> >     or: perf record [<options>] -- <command> [<options>]
> >
> >         --num-thread-synthesize <n>
> >                           number of threads to run for event synthesis
> >         --synth <no|all|task|mmap|cgroup>
> >                           Fine-tune event synthesis: default=all
> >
> > ⬢[acme@toolbox perf-tools-next]$
> >
> > For this specific initial synthesis of everything the plan, as mentioned
> > about Jiri's experiments, was to use a BPF iterator to just feed the
> > perf ring buffer with those events, that way userspace would just
> > receive the usual records it gets when a new mmap is put in place, the
> > BPF iterator would just feed the preexisting mmaps, as instructed via
> > the perf_event_attr for the perf_event_open syscall.
> >
> > For people not wanting BPF, i.e. disabling it altogether in perf or
> > disabling just BPF skels, then we would fallback to the current method,
> > or to the one being discussed here when it becomes available.
> >
> > One thing to have in mind is for this iterator not to generate duplicate
> > records for non-pre-existing mmaps, i.e. we would need some generation
> > number that would be bumped when asking for such pre-existing maps
> > PERF_RECORD_MMAP2 dumps.
> >
> > > It will be up to other similar projects to adopt this, but we'll
> > > definitely get this into blazesym as it is actually a problem for the
> >
> > At some point looking at plugging blazesym somehow with perf may be
> > something to consider, indeed.
> >
> > - Arnaldo
> >
> > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > [2], this wasn't done just because we could, but it was requested by
> > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > the risk of missing some shared libraries that can be loaded later. It
> > > would be great to not have to do this tradeoff, which this new API
> > > would enable.
> > >
> > >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > >
> > > >
> > > > > ---
> > > > >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> > > > >  include/uapi/linux/fs.h |  32 ++++++++
> > > > >  2 files changed, 197 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -22,6 +22,7 @@
> > > > >  #include <linux/pkeys.h>
> > > > >  #include <linux/minmax.h>
> > > > >  #include <linux/overflow.h>
> > > > > +#include <linux/buildid.h>
> > > > >
> > > > >  #include <asm/elf.h>
> > > > >  #include <asm/tlb.h>
> > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > >       return do_maps_open(inode, file, &proc_pid_maps_op);
> > > > >  }
> > > > >
> > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > > > +{
> > > > > +     struct procfs_procmap_query karg;
> > > > > +     struct vma_iterator iter;
> > > > > +     struct vm_area_struct *vma;
> > > > > +     struct mm_struct *mm;
> > > > > +     const char *name = NULL;
> > > > > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > > > +     __u64 usize;
> > > > > +     int err;
> > > > > +
> > > > > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > > > +             return -EFAULT;
> > > > > +     if (usize > PAGE_SIZE)
> > > >
> > > > Nice, where did you document that?  And how is that portable given that
> > > > PAGE_SIZE can be different on different systems?
> > >
> > > I'm happy to document everything, can you please help by pointing
> > > where this documentation has to live?
> > >
> > > This is mostly fool-proofing, though, because the user has to pass
> > > sizeof(struct procfs_procmap_query), which I don't see ever getting
> > > close to even 4KB (not even saying about 64KB). This is just to
> > > prevent copy_struct_from_user() below to do too much zero-checking.
> > >
> > > >
> > > > and why aren't you checking the actual structure size instead?  You can
> > > > easily run off the end here without knowing it.
> > >
> > > See copy_struct_from_user(), it does more checks. This is a helper
> > > designed specifically to deal with use cases like this where kernel
> > > struct size can change and user space might be newer or older.
> > > copy_struct_from_user() has a nice documentation describing all these
> > > nuances.
> > >
> > > >
> > > > > +             return -E2BIG;
> > > > > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > > > +             return -EINVAL;
> > > >
> > > > Ok, so you have two checks?  How can the first one ever fail?
> > >
> > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> > > won't fail, but this one will fail.
> > >
> > > The point of this check is that user has to specify at least first
> > > three fields of procfs_procmap_query (size, query_flags, and
> > > query_addr), because without those the query is meaningless.
> > > >
> > > >
> > > > > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > >
> > > and this helper does more checks validating that the user either has a
> > > shorter struct (and then zero-fills the rest of kernel-side struct) or
> > > has longer (and then the longer part has to be zero filled). Do check
> > > copy_struct_from_user() documentation, it's great.
> > >
> > > > > +     if (err)
> > > > > +             return err;
> > > > > +
> > > > > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > > > +             return -EINVAL;
> > > > > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > > > +             return -EINVAL;
> > > > > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > > > > +             return -EINVAL;
> > > >
> > > > So you want values to be set, right?
> > >
> > > Either both should be set, or neither. It's ok for both size/addr
> > > fields to be zero, in which case it indicates that the user doesn't
> > > want this part of information (which is usually a bit more expensive
> > > to get and might not be necessary for all the cases).
> > >
> > > >
> > > > > +
> > > > > +     mm = priv->mm;
> > > > > +     if (!mm || !mmget_not_zero(mm))
> > > > > +             return -ESRCH;
> > > >
> > > > What is this error for?  Where is this documentned?
> > >
> > > I copied it from existing /proc/<pid>/maps checks. I presume it's
> > > guarding the case when mm might be already put. So if the process is
> > > gone, but we have /proc/<pid>/maps file open?
> > >
> > > >
> > > > > +     if (mmap_read_lock_killable(mm)) {
> > > > > +             mmput(mm);
> > > > > +             return -EINTR;
> > > > > +     }
> > > > > +
> > > > > +     vma_iter_init(&iter, mm, karg.query_addr);
> > > > > +     vma = vma_next(&iter);
> > > > > +     if (!vma) {
> > > > > +             err = -ENOENT;
> > > > > +             goto out;
> > > > > +     }
> > > > > +     /* user wants covering VMA, not the closest next one */
> > > > > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > > > +         vma->vm_start > karg.query_addr) {
> > > > > +             err = -ENOENT;
> > > > > +             goto out;
> > > > > +     }
> > > > > +
> > > > > +     karg.vma_start = vma->vm_start;
> > > > > +     karg.vma_end = vma->vm_end;
> > > > > +
> > > > > +     if (vma->vm_file) {
> > > > > +             const struct inode *inode = file_user_inode(vma->vm_file);
> > > > > +
> > > > > +             karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > > > +             karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > > > +             karg.dev_minor = MINOR(inode->i_sb->s_dev);
> > > >
> > > > So the major/minor is that of the file superblock?  Why?
> > >
> > > Because inode number is unique only within given super block (and even
> > > then it's more complicated, e.g., btrfs subvolumes add more headaches,
> > > I believe). inode + dev maj/min is sometimes used for cache/reuse of
> > > per-binary information (e.g., pre-processed DWARF information, which
> > > is *very* expensive, so anything that allows to avoid doing this is
> > > helpful).
> > >
> > > >
> > > > > +             karg.inode = inode->i_ino;
> > > >
> > > > What is userspace going to do with this?
> > > >
> > >
> > > See above.
> > >
> > > > > +     } else {
> > > > > +             karg.vma_offset = 0;
> > > > > +             karg.dev_major = 0;
> > > > > +             karg.dev_minor = 0;
> > > > > +             karg.inode = 0;
> > > >
> > > > Why not set everything to 0 up above at the beginning so you never miss
> > > > anything, and you don't miss any holes accidentally in the future.
> > > >
> > >
> > > Stylistic preference, I find this more explicit, but I don't care much
> > > one way or another.
> > >
> > > > > +     }
> > > > > +
> > > > > +     karg.vma_flags = 0;
> > > > > +     if (vma->vm_flags & VM_READ)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > > > +     if (vma->vm_flags & VM_WRITE)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > > > +     if (vma->vm_flags & VM_EXEC)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > > > +     if (vma->vm_flags & VM_MAYSHARE)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > > > +
> > >
> > > [...]
> > >
> > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > index 45e4e64fd664..fe8924a8d916 100644
> > > > > --- a/include/uapi/linux/fs.h
> > > > > +++ b/include/uapi/linux/fs.h
> > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > > >       __u64 return_mask;
> > > > >  };
> > > > >
> > > > > +/* /proc/<pid>/maps ioctl */
> > > > > +#define PROCFS_IOCTL_MAGIC 0x9f
> > > >
> > > > Don't you need to document this in the proper place?
> > >
> > > I probably do, but I'm asking for help in knowing where. procfs is not
> > > a typical area of kernel I'm working with, so any pointers are highly
> > > appreciated.
> > >
> > > >
> > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > > > +
> > > > > +enum procmap_query_flags {
> > > > > +     PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > > > +};
> > > > > +
> > > > > +enum procmap_vma_flags {
> > > > > +     PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > > > +     PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > > > +     PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > > > +     PROCFS_PROCMAP_VMA_SHARED = 0x08,
> > > >
> > > > Are these bits?  If so, please use the bit macro for it to make it
> > > > obvious.
> > > >
> > >
> > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> > > add any extra #includes to this UAPI header, but I can figure out the
> > > necessary dependency and do BIT(), I just didn't feel like BIT() adds
> > > much here, tbh.
> > >
> > > > > +};
> > > > > +
> > > > > +struct procfs_procmap_query {
> > > > > +     __u64 size;
> > > > > +     __u64 query_flags;              /* in */
> > > >
> > > > Does this map to the procmap_vma_flags enum?  if so, please say so.
> > >
> > > no, procmap_query_flags, and yes, I will
> > >
> > > >
> > > > > +     __u64 query_addr;               /* in */
> > > > > +     __u64 vma_start;                /* out */
> > > > > +     __u64 vma_end;                  /* out */
> > > > > +     __u64 vma_flags;                /* out */
> > > > > +     __u64 vma_offset;               /* out */
> > > > > +     __u64 inode;                    /* out */
> > > >
> > > > What is the inode for, you have an inode for the file already, why give
> > > > it another one?
> > >
> > > This is inode of vma's backing file, same as /proc/<pid>/maps' file
> > > column. What inode of file do I already have here? You mean of
> > > /proc/<pid>/maps itself? It's useless for the intended purposes.
> > >
> > > >
> > > > > +     __u32 dev_major;                /* out */
> > > > > +     __u32 dev_minor;                /* out */
> > > >
> > > > What is major/minor for?
> > >
> > > This is the same information as emitted by /proc/<pid>/maps,
> > > identifies superblock of vma's backing file. As I mentioned above, it
> > > can be used for caching per-file (i.e., per-ELF binary) information
> > > (for example).
> > >
> > > >
> > > > > +     __u32 vma_name_size;            /* in/out */
> > > > > +     __u32 build_id_size;            /* in/out */
> > > > > +     __u64 vma_name_addr;            /* in */
> > > > > +     __u64 build_id_addr;            /* in */
> > > >
> > > > Why not document this all using kerneldoc above the structure?
> > >
> > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> > > figuring out the best place and approach, and so wanted to avoid
> > > documentation churn.
> > >
> > > Would something like what we have for pm_scan_arg and pagemap APIs
> > > work? I see it added a few simple descriptions for pm_scan_arg struct,
> > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> > > though)? Anyways, I'm hoping for pointers where all this should be
> > > documented. Thank you!
> > >
> > > >
> > > > anyway, I don't like ioctls, but there is a place for them, you just
> > > > have to actually justify the use for them and not say "not efficient
> > > > enough" as that normally isn't an issue overall.
> > >
> > > I've written a demo tool in patch #5 which performs real-world task:
> > > mapping addresses to their VMAs (specifically calculating file offset,
> > > finding vma_start + vma_end range to further access files from
> > > /proc/<pid>/map_files/<start>-<end>). I did the implementation
> > > faithfully, doing it in the most optimal way for both APIs. I showed
> > > that for "typical" (it's hard to specify what typical is, of course,
> > > too many variables) scenario (it was data collected on a real server
> > > running real service, 30 seconds of process-specific stack traces were
> > > captured, if I remember correctly). I showed that doing exactly the
> > > same amount of work is ~35x times slower with /proc/<pid>/maps.
> > >
> > > Take another process, another set of addresses, another anything, and
> > > the numbers will be different, but I think it gives the right idea.
> > >
> > > But I think we are overpivoting on text vs binary distinction here.
> > > It's the more targeted querying of VMAs that's beneficial here. This
> > > allows applications to not cache anything and just re-query when doing
> > > periodic or continuous profiling (where addresses are coming in not as
> > > one batch, as a sequence of batches extended in time).
> > >
> > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> > > of ability, as it wasn't designed to do that and is targeting
> > > different use cases.
> > >
> > > And then, a new ability to request reliable (it's not 100% reliable
> > > today, I'm going to address that as a follow up) build ID is *crucial*
> > > for some scenarios. The mentioned Oculus use case, the need to fully
> > > access underlying ELF binary just to get build ID is frowned upon. And
> > > for a good reason. Profiler only needs build ID, which is no secret
> > > and not sensitive information. This new (and binary, yes) API allows
> > > to add this into an API without breaking any backwards compatibility.
> > >
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> >