On Tue, Jun 04, 2024 at 05:24:48PM -0700, Andrii Nakryiko wrote: > /proc/<pid>/maps file is extremely useful in practice for various tasks > involving figuring out process memory layout, what files are backing any > given memory range, etc. One important class of applications that > absolutely rely on this are profilers/stack symbolizers (perf tool being one > of them). Patterns of use differ, but they generally would fall into two > categories. > > In on-demand pattern, a profiler/symbolizer would normally capture stack > trace containing absolute memory addresses of some functions, and would > then use /proc/<pid>/maps file to find corresponding backing ELF files > (normally, only executable VMAs are of interest), file offsets within > them, and then continue from there to get yet more information (ELF > symbols, DWARF information) to get human-readable symbolic information. > This pattern is used by Meta's fleet-wide profiler, as one example. > > In preprocessing pattern, application doesn't know the set of addresses > of interest, so it has to fetch all relevant VMAs (again, probably only > executable ones), store or cache them, then proceed with profiling and > stack trace capture. Once done, it would do symbolization based on > stored VMA information. This can happen at much later point in time. > This patterns is used by perf tool, as an example. > > In either case, there are both performance and correctness requirement > involved. This address to VMA information translation has to be done as > efficiently as possible, but also not miss any VMA (especially in the > case of loading/unloading shared libraries). In practice, correctness > can't be guaranteed (due to process dying before VMA data can be > captured, or shared library being unloaded, etc), but any effort to > maximize the chance of finding the VMA is appreciated. > > Unfortunately, for all the /proc/<pid>/maps file universality and > usefulness, it doesn't fit the above use cases 100%. > > First, it's main purpose is to emit all VMAs sequentially, but in > practice captured addresses would fall only into a smaller subset of all > process' VMAs, mainly containing executable text. Yet, library would > need to parse most or all of the contents to find needed VMAs, as there > is no way to skip VMAs that are of no use. Efficient library can do the > linear pass and it is still relatively efficient, but it's definitely an > overhead that can be avoided, if there was a way to do more targeted > querying of the relevant VMA information. > > Second, it's a text based interface, which makes its programmatic use from > applications and libraries more cumbersome and inefficient due to the > need to handle text parsing to get necessary pieces of information. The > overhead is actually payed both by kernel, formatting originally binary > VMA data into text, and then by user space application, parsing it back > into binary data for further use. I was trying to solve all these issues in a more generic way: https://lwn.net/Articles/683371/ We definitely interested in this new interface to use it in CRIU. <snip> > + > + if (karg.vma_name_size) { > + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); > + const struct path *path; > + const char *name_fmt; > + size_t name_sz = 0; > + > + get_vma_name(vma, &path, &name, &name_fmt); > + > + if (path || name_fmt || name) { > + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); > + if (!name_buf) { > + err = -ENOMEM; > + goto out; > + } > + } > + if (path) { > + name = d_path(path, name_buf, name_buf_sz); > + if (IS_ERR(name)) { > + err = PTR_ERR(name); > + goto out; It always fails if a file path name is longer than PATH_MAX. Can we add a flag to indicate whether file names are needed to be resolved? In criu, we use special names like "vvar", "vdso", but we dump files via /proc/pid/map_files. > + } > + name_sz = name_buf + name_buf_sz - name; > + } else if (name || name_fmt) { > + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); > + name = name_buf; > + } > + if (name_sz > name_buf_sz) { > + err = -ENAMETOOLONG; > + goto out; > + } > + karg.vma_name_size = name_sz; > + } Thanks, Andrei