On Tue, May 7, 2024 at 11:10 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > * Andrii Nakryiko <andrii@xxxxxxxxxx> [240503 20:30]: > > /proc/<pid>/maps file is extremely useful in practice for various tasks > > involving figuring out process memory layout, what files are backing any > > given memory range, etc. One important class of applications that > > absolutely rely on this are profilers/stack symbolizers. They would > > normally capture stack trace containing absolute memory addresses of > > some functions, and would then use /proc/<pid>/maps file to file > > corresponding backing ELF files, file offsets within them, and then > > continue from there to get yet more information (ELF symbols, DWARF > > information) to get human-readable symbolic information. > > > > As such, there are both performance and correctness requirement > > involved. This address to VMA information translation has to be done as > > efficiently as possible, but also not miss any VMA (especially in the > > case of loading/unloading shared libraries). > > > > Unfortunately, for all the /proc/<pid>/maps file universality and > > usefulness, it doesn't fit the above 100%. > > > > First, it's text based, which makes its programmatic use from > > applications and libraries unnecessarily cumbersome and slow due to the > > need to do text parsing to get necessary pieces of information. > > > > Second, it's main purpose is to emit all VMAs sequentially, but in > > practice captured addresses would fall only into a small subset of all > > process' VMAs, mainly containing executable text. Yet, library would > > need to parse most or all of the contents to find needed VMAs, as there > > is no way to skip VMAs that are of no use. Efficient library can do the > > linear pass and it is still relatively efficient, but it's definitely an > > overhead that can be avoided, if there was a way to do more targeted > > querying of the relevant VMA information. > > > > Another problem when writing generic stack trace symbolization library > > is an unfortunate performance-vs-correctness tradeoff that needs to be > > made. Library has to make a decision to either cache parsed contents of > > /proc/<pid>/maps for service future requests (if application requests to > > symbolize another set of addresses, captured at some later time, which > > is typical for periodic/continuous profiling cases) to avoid higher > > costs of needed to re-parse this file or caching the contents in memory > > to speed up future requests. In the former case, more memory is used for > > the cache and there is a risk of getting stale data if application > > loaded/unloaded shared libraries, or otherwise changed its set of VMAs > > through additiona mmap() calls (and other means of altering memory > > address space). In the latter case, it's the performance hit that comes > > from re-opening the file and re-reading/re-parsing its contents all over > > again. > > > > This patch aims to solve this problem by providing a new API built on > > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary > > interface, avoiding the cost and awkwardness of textual representation > > for programmatic use. It's designed to be extensible and > > forward/backward compatible by including user-specified field size and > > using copy_struct_from_user() approach. But, most importantly, it allows > > to do point queries for specific single address, specified by user. And > > this is done efficiently using VMA iterator. > > > > User has a choice to pick either getting VMA that covers provided > > address or -ENOENT if none is found (exact, least surprising, case). Or, > > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can > > get either VMA that covers the address (if there is one), or the closest > > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows > > more efficient use, but, given it could be a surprising behavior, > > requires an explicit opt-in. > > > > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes > > sense given it's querying the same set of VMA data. All the permissions > > checks performed on /proc/<pid>/maps opening fit here as well. > > ioctl-based implementation is fetching remembered mm_struct reference, > > but otherwise doesn't interfere with seq_file-based implementation of > > /proc/<pid>/maps textual interface, and so could be used together or > > independently without paying any price for that. > > > > There is one extra thing that /proc/<pid>/maps doesn't currently > > provide, and that's an ability to fetch ELF build ID, if present. User > > has control over whether this piece of information is requested or not > > by either setting build_id_size field to zero or non-zero maximum buffer > > size they provided through build_id_addr field (which encodes user > > pointer as __u64 field). > > > > The need to get ELF build ID reliably is an important aspect when > > dealing with profiling and stack trace symbolization, and > > /proc/<pid>/maps textual representation doesn't help with this, > > requiring applications to open underlying ELF binary through > > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra > > permissions implications due giving a full access to the binary from > > (potentially) another process, while all application is interested in is > > build ID. Giving an ability to request just build ID doesn't introduce > > any additional security concerns, on top of what /proc/<pid>/maps is > > already concerned with, simplifying the overall logic. > > > > Kernel already implements build ID fetching, which is used from BPF > > subsystem. We are reusing this code here, but plan a follow up changes > > to make it work better under more relaxed assumption (compared to what > > existing code assumes) of being called from user process context, in > > which page faults are allowed. BPF-specific implementation currently > > bails out if necessary part of ELF file is not paged in, all due to > > extra BPF-specific restrictions (like the need to fetch build ID in > > restrictive contexts such as NMI handler). > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > hard-coded or user-provided names) is optional just like build ID. If > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > it, saving resources. > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/fs.h | 32 ++++++++ > > 2 files changed, 197 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 8e503a1635b7..cb7b1ff1a144 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -22,6 +22,7 @@ > > #include <linux/pkeys.h> > > #include <linux/minmax.h> > > #include <linux/overflow.h> > > +#include <linux/buildid.h> > > > > #include <asm/elf.h> > > #include <asm/tlb.h> > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > return do_maps_open(inode, file, &proc_pid_maps_op); > > } > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > +{ > > + struct procfs_procmap_query karg; > > + struct vma_iterator iter; > > + struct vm_area_struct *vma; > > + struct mm_struct *mm; > > + const char *name = NULL; > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > + __u64 usize; > > + int err; > > + > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > + return -EFAULT; > > + if (usize > PAGE_SIZE) > > + return -E2BIG; > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > + return -EINVAL; > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > + if (err) > > + return err; > > + > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > + return -EINVAL; > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > + return -EINVAL; > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > + return -EINVAL; > > + > > + mm = priv->mm; > > + if (!mm || !mmget_not_zero(mm)) > > + return -ESRCH; > > + if (mmap_read_lock_killable(mm)) { > > + mmput(mm); > > + return -EINTR; > > + } > > Using the rcu lookup here will allow for more success rate with less > lock contention. > If you have any code pointers, I'd appreciate it. If not, I'll try to find it myself, no worries. > > + > > + vma_iter_init(&iter, mm, karg.query_addr); > > + vma = vma_next(&iter); > > + if (!vma) { > > + err = -ENOENT; > > + goto out; > > + } > > + /* user wants covering VMA, not the closest next one */ > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > + vma->vm_start > karg.query_addr) { > > + err = -ENOENT; > > + goto out; > > + } > > The interface you are using is a start address to search from to the end > of the address space, so this won't work as you intended with the > PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator Maybe the name isn't the best, by "EXACT" here I meant "VMA that exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA" would be better wording. With that out of the way, I think this API works exactly how I expect it to work: # cat /proc/3406/maps | grep -C1 7f42099fe000 7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0 7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 # cat addrs.txt 0x7f42099fe010 # ./procfs_query -f addrs.txt -p 3406 -v -Q PID: 3406 PATH: addrs.txt READ 1 addrs! SORTED ADDRS (1): ADDR #0: 0x7f42099fe010 VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 (build ID: NO, 0 bytes) RESOLVED ADDRS (1): RESOLVED #0: 0x7f42099fe010 -> OFF 0x2010 NAME /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8 You can see above that for the requested 0x7f42099fe010 address we got a VMA that starts before this address: 7f42099fe000-7f4209a0e000, which is what we want. Before submitting I ran the tool with /proc/<pid>/maps and ioctl to "resolve" the exact same set of addresses and I compared results. They were identical. Note, there is a small bug in the tool I added in patch #5. I changed `-i` argument to `-Q` at the very last moment and haven't updated the code in one place. But other than that I didn't change anything. For the above output, I added "VMA FOUND" verbose logging to see all the details of VMA, not just resolved offset. I'll add that in v2. > has the desired interface you want as the single address lookup doesn't > use the vma iterator. I'd just run the vma_next() and check the limits. > See find_exact_vma() for the limit checks. > > > + > > + karg.vma_start = vma->vm_start; > > + karg.vma_end = vma->vm_end; > > + [...]