On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote: > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote: > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote: > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote: > > > > > Note also, that fetching VMA name (e.g., backing file path, or special > > > > > hard-coded or user-provided names) is optional just like build ID. If > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve > > > > > it, saving resources. > > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > > Where is the userspace code that uses this new api you have created? > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new > > > ioctl() API to solve a common problem (as described above) in patch > > > #5. The plan is to put it in mentioned blazesym library at the very > > > least. > > > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and > > > linux-perf-user), as they need to do stack symbolization as well. > I think the general use case in perf is different. This ioctl API is great > for live tracing of a single (or a small number of) process(es). And > yes, perf tools have those tracing use cases too. But I think the > major use case of perf tools is system-wide profiling. > For system-wide profiling, you need to process samples of many > different processes at a high frequency. Now perf record doesn't > process them and just save it for offline processing (well, it does > at the end to find out build-ID but it can be omitted). Since: Author: Jiri Olsa <jolsa@xxxxxxxxxx> Date: Mon Dec 14 11:54:49 2020 +0100 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events") We don't need to to process the events to find the build ids. I haven't checked if we still do it to find out which DSOs had hits, but we shouldn't need to do it for build-ids (unless they were not in memory when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I haven't checked but IIRC is a possibility if that ELF part isn't in memory at the time we want to copy it). If we're still traversing it like that I guess we can have a knob and make it the default to not do that and instead create the perf.data build ID header table with all the build-ids we got from PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event processing at the end of a 'perf record' session. > Doing it online is possible (like perf top) but it would add more > overhead during the profiling. And we cannot move processing It comes in the PERF_RECORD_MMAP2, filled by the kernel. > or symbolization to the end of profiling because some (short- > lived) tasks can go away. right > Also it should support perf report (offline) on data from a > different kernel or even a different machine. right > So it saves the memory map of processes and symbolizes > the stack trace with it later. Of course it needs to be updated > as the memory map changes and that's why it tracks mmap > or similar syscalls with PERF_RECORD_MMAP[2] records. > A problem with this approach is to get the initial state of all > (or a target for non-system-wide mode) existing processes. > We call it synthesizing, and read /proc/PID/maps to generate > the mmap records. > I think the below comment from Arnaldo talked about how > we can improve the synthesizing (which is sequential access > to proc maps) using BPF. Yes, I wonder how far Jiri went, Jiri? - Arnaldo > Thanks, > Namhyung > > > > > > At some point, when BPF iterators became a thing we thought about, IIRC > > Jiri did some experimentation, but I lost track, of using BPF to > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout > > as in uapi/linux/perf_event.h: > > > > /* > > * The MMAP2 records are an augmented version of MMAP, they add > > * maj, min, ino numbers to be used to uniquely identify each mapping > > * > > * struct { > > * struct perf_event_header header; > > * > > * u32 pid, tid; > > * u64 addr; > > * u64 len; > > * u64 pgoff; > > * union { > > * struct { > > * u32 maj; > > * u32 min; > > * u64 ino; > > * u64 ino_generation; > > * }; > > * struct { > > * u8 build_id_size; > > * u8 __reserved_1; > > * u16 __reserved_2; > > * u8 build_id[20]; > > * }; > > * }; > > * u32 prot, flags; > > * char filename[]; > > * struct sample_id sample_id; > > * }; > > */ > > PERF_RECORD_MMAP2 = 10, > > > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event > > > > As perf.data files can be used for many purposes we want them all, so we > > setup a meta data perf file descriptor to go on receiving the new mmaps > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do > > it in parallel, etc: > > > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis' > > > > Usage: perf record [<options>] [<command>] > > or: perf record [<options>] -- <command> [<options>] > > > > --num-thread-synthesize <n> > > number of threads to run for event synthesis > > --synth <no|all|task|mmap|cgroup> > > Fine-tune event synthesis: default=all > > > > ⬢[acme@toolbox perf-tools-next]$ > > > > For this specific initial synthesis of everything the plan, as mentioned > > about Jiri's experiments, was to use a BPF iterator to just feed the > > perf ring buffer with those events, that way userspace would just > > receive the usual records it gets when a new mmap is put in place, the > > BPF iterator would just feed the preexisting mmaps, as instructed via > > the perf_event_attr for the perf_event_open syscall. > > > > For people not wanting BPF, i.e. disabling it altogether in perf or > > disabling just BPF skels, then we would fallback to the current method, > > or to the one being discussed here when it becomes available. > > > > One thing to have in mind is for this iterator not to generate duplicate > > records for non-pre-existing mmaps, i.e. we would need some generation > > number that would be bumped when asking for such pre-existing maps > > PERF_RECORD_MMAP2 dumps. > > > > > It will be up to other similar projects to adopt this, but we'll > > > definitely get this into blazesym as it is actually a problem for the > > > > At some point looking at plugging blazesym somehow with perf may be > > something to consider, indeed. > > > > - Arnaldo > > > > > abovementioned Oculus use case. We already had to make a tradeoff (see > > > [2], this wasn't done just because we could, but it was requested by > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run > > > the risk of missing some shared libraries that can be loaded later. It > > > would be great to not have to do this tradeoff, which this new API > > > would enable. > > > > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf > > > > > > > > > > > > --- > > > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++ > > > > > include/uapi/linux/fs.h | 32 ++++++++ > > > > > 2 files changed, 197 insertions(+) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index 8e503a1635b7..cb7b1ff1a144 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -22,6 +22,7 @@ > > > > > #include <linux/pkeys.h> > > > > > #include <linux/minmax.h> > > > > > #include <linux/overflow.h> > > > > > +#include <linux/buildid.h> > > > > > > > > > > #include <asm/elf.h> > > > > > #include <asm/tlb.h> > > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > > return do_maps_open(inode, file, &proc_pid_maps_op); > > > > > } > > > > > > > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) > > > > > +{ > > > > > + struct procfs_procmap_query karg; > > > > > + struct vma_iterator iter; > > > > > + struct vm_area_struct *vma; > > > > > + struct mm_struct *mm; > > > > > + const char *name = NULL; > > > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; > > > > > + __u64 usize; > > > > > + int err; > > > > > + > > > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) > > > > > + return -EFAULT; > > > > > + if (usize > PAGE_SIZE) > > > > > > > > Nice, where did you document that? And how is that portable given that > > > > PAGE_SIZE can be different on different systems? > > > > > > I'm happy to document everything, can you please help by pointing > > > where this documentation has to live? > > > > > > This is mostly fool-proofing, though, because the user has to pass > > > sizeof(struct procfs_procmap_query), which I don't see ever getting > > > close to even 4KB (not even saying about 64KB). This is just to > > > prevent copy_struct_from_user() below to do too much zero-checking. > > > > > > > > > > > and why aren't you checking the actual structure size instead? You can > > > > easily run off the end here without knowing it. > > > > > > See copy_struct_from_user(), it does more checks. This is a helper > > > designed specifically to deal with use cases like this where kernel > > > struct size can change and user space might be newer or older. > > > copy_struct_from_user() has a nice documentation describing all these > > > nuances. > > > > > > > > > > > > + return -E2BIG; > > > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr)) > > > > > + return -EINVAL; > > > > > > > > Ok, so you have two checks? How can the first one ever fail? > > > > > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE > > > won't fail, but this one will fail. > > > > > > The point of this check is that user has to specify at least first > > > three fields of procfs_procmap_query (size, query_flags, and > > > query_addr), because without those the query is meaningless. > > > > > > > > > > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); > > > > > > and this helper does more checks validating that the user either has a > > > shorter struct (and then zero-fills the rest of kernel-side struct) or > > > has longer (and then the longer part has to be zero filled). Do check > > > copy_struct_from_user() documentation, it's great. > > > > > > > > + if (err) > > > > > + return err; > > > > > + > > > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) > > > > > + return -EINVAL; > > > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr) > > > > > + return -EINVAL; > > > > > + if (!!karg.build_id_size != !!karg.build_id_addr) > > > > > + return -EINVAL; > > > > > > > > So you want values to be set, right? > > > > > > Either both should be set, or neither. It's ok for both size/addr > > > fields to be zero, in which case it indicates that the user doesn't > > > want this part of information (which is usually a bit more expensive > > > to get and might not be necessary for all the cases). > > > > > > > > > > > > + > > > > > + mm = priv->mm; > > > > > + if (!mm || !mmget_not_zero(mm)) > > > > > + return -ESRCH; > > > > > > > > What is this error for? Where is this documentned? > > > > > > I copied it from existing /proc/<pid>/maps checks. I presume it's > > > guarding the case when mm might be already put. So if the process is > > > gone, but we have /proc/<pid>/maps file open? > > > > > > > > > > > > + if (mmap_read_lock_killable(mm)) { > > > > > + mmput(mm); > > > > > + return -EINTR; > > > > > + } > > > > > + > > > > > + vma_iter_init(&iter, mm, karg.query_addr); > > > > > + vma = vma_next(&iter); > > > > > + if (!vma) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + /* user wants covering VMA, not the closest next one */ > > > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) && > > > > > + vma->vm_start > karg.query_addr) { > > > > > + err = -ENOENT; > > > > > + goto out; > > > > > + } > > > > > + > > > > > + karg.vma_start = vma->vm_start; > > > > > + karg.vma_end = vma->vm_end; > > > > > + > > > > > + if (vma->vm_file) { > > > > > + const struct inode *inode = file_user_inode(vma->vm_file); > > > > > + > > > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; > > > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev); > > > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev); > > > > > > > > So the major/minor is that of the file superblock? Why? > > > > > > Because inode number is unique only within given super block (and even > > > then it's more complicated, e.g., btrfs subvolumes add more headaches, > > > I believe). inode + dev maj/min is sometimes used for cache/reuse of > > > per-binary information (e.g., pre-processed DWARF information, which > > > is *very* expensive, so anything that allows to avoid doing this is > > > helpful). > > > > > > > > > > > > + karg.inode = inode->i_ino; > > > > > > > > What is userspace going to do with this? > > > > > > > > > > See above. > > > > > > > > + } else { > > > > > + karg.vma_offset = 0; > > > > > + karg.dev_major = 0; > > > > > + karg.dev_minor = 0; > > > > > + karg.inode = 0; > > > > > > > > Why not set everything to 0 up above at the beginning so you never miss > > > > anything, and you don't miss any holes accidentally in the future. > > > > > > > > > > Stylistic preference, I find this more explicit, but I don't care much > > > one way or another. > > > > > > > > + } > > > > > + > > > > > + karg.vma_flags = 0; > > > > > + if (vma->vm_flags & VM_READ) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE; > > > > > + if (vma->vm_flags & VM_WRITE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE; > > > > > + if (vma->vm_flags & VM_EXEC) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE; > > > > > + if (vma->vm_flags & VM_MAYSHARE) > > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED; > > > > > + > > > > > > [...] > > > > > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > > > > index 45e4e64fd664..fe8924a8d916 100644 > > > > > --- a/include/uapi/linux/fs.h > > > > > +++ b/include/uapi/linux/fs.h > > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg { > > > > > __u64 return_mask; > > > > > }; > > > > > > > > > > +/* /proc/<pid>/maps ioctl */ > > > > > +#define PROCFS_IOCTL_MAGIC 0x9f > > > > > > > > Don't you need to document this in the proper place? > > > > > > I probably do, but I'm asking for help in knowing where. procfs is not > > > a typical area of kernel I'm working with, so any pointers are highly > > > appreciated. > > > > > > > > > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query) > > > > > + > > > > > +enum procmap_query_flags { > > > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01, > > > > > +}; > > > > > + > > > > > +enum procmap_vma_flags { > > > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01, > > > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02, > > > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04, > > > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08, > > > > > > > > Are these bits? If so, please use the bit macro for it to make it > > > > obvious. > > > > > > > > > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to > > > add any extra #includes to this UAPI header, but I can figure out the > > > necessary dependency and do BIT(), I just didn't feel like BIT() adds > > > much here, tbh. > > > > > > > > +}; > > > > > + > > > > > +struct procfs_procmap_query { > > > > > + __u64 size; > > > > > + __u64 query_flags; /* in */ > > > > > > > > Does this map to the procmap_vma_flags enum? if so, please say so. > > > > > > no, procmap_query_flags, and yes, I will > > > > > > > > > > > > + __u64 query_addr; /* in */ > > > > > + __u64 vma_start; /* out */ > > > > > + __u64 vma_end; /* out */ > > > > > + __u64 vma_flags; /* out */ > > > > > + __u64 vma_offset; /* out */ > > > > > + __u64 inode; /* out */ > > > > > > > > What is the inode for, you have an inode for the file already, why give > > > > it another one? > > > > > > This is inode of vma's backing file, same as /proc/<pid>/maps' file > > > column. What inode of file do I already have here? You mean of > > > /proc/<pid>/maps itself? It's useless for the intended purposes. > > > > > > > > > > > > + __u32 dev_major; /* out */ > > > > > + __u32 dev_minor; /* out */ > > > > > > > > What is major/minor for? > > > > > > This is the same information as emitted by /proc/<pid>/maps, > > > identifies superblock of vma's backing file. As I mentioned above, it > > > can be used for caching per-file (i.e., per-ELF binary) information > > > (for example). > > > > > > > > > > > > + __u32 vma_name_size; /* in/out */ > > > > > + __u32 build_id_size; /* in/out */ > > > > > + __u64 vma_name_addr; /* in */ > > > > > + __u64 build_id_addr; /* in */ > > > > > > > > Why not document this all using kerneldoc above the structure? > > > > > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be > > > figuring out the best place and approach, and so wanted to avoid > > > documentation churn. > > > > > > Would something like what we have for pm_scan_arg and pagemap APIs > > > work? I see it added a few simple descriptions for pm_scan_arg struct, > > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add > > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off, > > > though)? Anyways, I'm hoping for pointers where all this should be > > > documented. Thank you! > > > > > > > > > > > anyway, I don't like ioctls, but there is a place for them, you just > > > > have to actually justify the use for them and not say "not efficient > > > > enough" as that normally isn't an issue overall. > > > > > > I've written a demo tool in patch #5 which performs real-world task: > > > mapping addresses to their VMAs (specifically calculating file offset, > > > finding vma_start + vma_end range to further access files from > > > /proc/<pid>/map_files/<start>-<end>). I did the implementation > > > faithfully, doing it in the most optimal way for both APIs. I showed > > > that for "typical" (it's hard to specify what typical is, of course, > > > too many variables) scenario (it was data collected on a real server > > > running real service, 30 seconds of process-specific stack traces were > > > captured, if I remember correctly). I showed that doing exactly the > > > same amount of work is ~35x times slower with /proc/<pid>/maps. > > > > > > Take another process, another set of addresses, another anything, and > > > the numbers will be different, but I think it gives the right idea. > > > > > > But I think we are overpivoting on text vs binary distinction here. > > > It's the more targeted querying of VMAs that's beneficial here. This > > > allows applications to not cache anything and just re-query when doing > > > periodic or continuous profiling (where addresses are coming in not as > > > one batch, as a sequence of batches extended in time). > > > > > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort > > > of ability, as it wasn't designed to do that and is targeting > > > different use cases. > > > > > > And then, a new ability to request reliable (it's not 100% reliable > > > today, I'm going to address that as a follow up) build ID is *crucial* > > > for some scenarios. The mentioned Oculus use case, the need to fully > > > access underlying ELF binary just to get build ID is frowned upon. And > > > for a good reason. Profiler only needs build ID, which is no secret > > > and not sensitive information. This new (and binary, yes) API allows > > > to add this into an API without breaking any backwards compatibility. > > > > > > > > > > > thanks, > > > > > > > > greg k-h > >