On Tue, Feb 13, 2024 at 05:45:45PM +0100, Christian Brauner wrote: > Hey, > > This moves pidfds from the anonymous inode infrastructure to a tiny > pseudo filesystem. This has been on my todo for quite a while as it will > unblock further work that we weren't able to do so far simply because of > the very justified limitations of anonymous inodes. So yesterday I sat > down and wrote it down. > > Back when I added pidfds the concept was new (on Linux) and the > limitations were acceptable but now it's starting to hurt us. And with > the concept of pidfds having been around quite a while and being widely > used this is worth doing. This makes it so that: > > * statx() on pidfds becomes useful for the first time. > * pidfds can be compared simply via statx() for equality. > * pidfds have unique inode numbers for the system lifetime. > * struct pid is now stashed in inode->i_private instead of > file->private_data. This means it is now possible to introduce > concepts that operate on a process once all file descriptors have been > closed. A concrete example is kill-on-last-close. > * file->private_data is freed up for per-file options for pidfds. > * Each struct pid will refer to a different inode but the same struct > pid will refer to the same inode if it's opened multiple times. In > contrast to now where each struct pid refers to the same inode. Even > if we were to move to anon_inode_create_getfile() which creates new > inodes we'd still be associating the same struct pid with multiple > different inodes. > * Pidfds now go through the regular dentry_open() path which means that > all security hooks are called unblocking proper LSM management for > pidfds. In addition fsnotify hooks are called and allow for listening > to open events on pidfds. > > The tiny pseudo filesystem is not visible anywhere in userspace exactly > like e.g., pipefs and sockfs. There's no lookup, there's no inode > operations in general, so nothing complex. It's hopefully the best kind > of dumb there is. Dentries and inodes are always deleted when the last > pidfd is closed. > > I've made the new code optional and placed it under CONFIG_FS_PIDFD but > I'm confident we can remove that very soon. This takes some inspiration > from nsfs which uses a similar stashing mechanism. > > Thanks! > Christian > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > > --- > base-commit: 3f643cd2351099e6b859533b6f984463e5315e5f > change-id: 20240212-vfs-pidfd_fs-9a6e49283d80 I forgot to mention that pidfds are explicitly not simply directory inodes in procfs for various reasons so this isn't an option I want to pursue. Integrating them into procfs would be a nasty level of complexity that makes for very ugly and convoluted code. Especially how this would need to be integrated into copy_process() and other locations. It also poses significant security and permission checking challenges to userspace because it is generally not safe to send around file descriptors for /proc/<pid> directories. It's a pretty big attack vector and cause of security issues. So really this is not a path that I want to go down. It defeats the whole purpose of pidfds as opaque, easy delegatable handles. Oh, and tree is vfs.pidfd at the usual location https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git