On Sat, May 4, 2024 at 4:24 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote: > > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow > > applications to query VMA information more efficiently than through textual > > processing of /proc/<pid>/maps contents. See patch #2 for the context, > > justification, and nuances of the API design. > > > > Patch #1 is a refactoring to keep VMA name logic determination in one place. > > Patch #2 is the meat of kernel-side API. > > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include. > > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to > > optionally use this new ioctl()-based API, if supported. > > Patch #5 implements a simple C tool to demonstrate intended efficient use (for > > both textual and binary interfaces) and allows benchmarking them. Patch itself > > also has performance numbers of a test based on one of the medium-sized > > internal applications taken from production. > > I don't have anything against adding a binary interface for this. But > it's somewhat odd to do ioctls based on /proc files. I wonder if there > isn't a more suitable place for this. prctl()? New vmstat() system call > using a pidfd/pid as reference? ioctl() on fs/pidfs.c? I did ioctl() on /proc/<pid>/maps because that's the file that's used for the same use cases and it can be opened from other processes for any target PID. I'm open to any suggestions that make more sense, this v1 is mostly to start the conversation. prctl() probably doesn't make sense, as according to man page: prctl() manipulates various aspects of the behavior of the calling thread or process. And this facility is most often used from another (profiler or symbolizer) process. New syscall feels like an overkill, but if that's the only way, so be it. I do like the idea of ioctl() on top of pidfd (I assume that's what you mean by "fs/pidfs.c", right)? This seems most promising. One question/nuance. If I understand correctly, pidfd won't hold task_struct (and its mm_struct) reference, right? So if the process exits, even if I have pidfd, that task is gone and so we won't be able to query it. Is that right? If yes, then it's still workable in a lot of situations, but it would be nice to have an ability to query VMAs (at least for binary's own text segments) even if the process exits. This is the case for short-lived processes that profilers capture some stack traces from, but by the time these stack traces are processed they are gone. This might be a stupid idea and question, but what if ioctl() on pidfd itself would create another FD that would represent mm_struct of that process, and then we have ioctl() on *that* soft-of-mm-struct-fd to query VMA. Would that work at all? This approach would allow long-running profiler application to open pidfd and this other "mm fd" once, cache it, and then just query it. Meanwhile we can epoll() pidfd itself to know when the process exits so that these mm_structs are not referenced for longer than necessary. Is this pushing too far or you think that would work and be acceptable? But in any case, I think ioctl() on top of pidfd makes total sense for this, thanks.