On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote: > On Fri, Nov 01, 2024 at 01:54:48PM +0000, Erin Shepherd wrote: > > Since the introduction of pidfs, we have had 64-bit process identifiers > > that will not be reused for the entire uptime of the system. This greatly > > facilitates process tracking in userspace. > > > > There are two limitations at present: > > > > * These identifiers are currently only exposed to processes on 64-bit > > systems. On 32-bit systems, inode space is also limited to 32 bits and > > therefore is subject to the same reuse issues. We should really just move to storing 64-bit inode numbers internally on 32-bit machines. That would at least make statx() give you all 64 bits on 32-bit host. > > * There is no way to go from one of these unique identifiers to a pid or > > pidfd. > > > > Patch 1 & 2 in this stack implements fh_export for pidfs. This means > > userspace can retrieve a unique process identifier even on 32-bit systems > > via name_to_handle_at. > > > > Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means > > userspace can convert back from a file handle to the corresponding pidfd. > > To support us going from a file handle to a pidfd, we have to store a pid > > inside the file handle. To ensure file handles are invariant and can move > > between pid namespaces, we stash a pid from the initial namespace inside > > the file handle. > > > > I'm not quite sure if stashing an initial-namespace pid inside the file > > handle is the right approach here; if not, I think that patch 1 & 2 are > > useful on their own. Hmm... I guess pid namespaces don't have a convenient 64-bit ID like mount namespaces do? In that case, stashing the pid from init_ns is probably the next best thing. > > Sorry for the delayed reply (I'm recovering from a lengthy illness.). > > I like the idea in general. I think this is really useful. A few of my > thoughts but I need input from Amir and Jeff: > > * In the last patch of the series you already implement decoding of > pidfd file handles by adding a .fh_to_dentry export_operations method. > > There are a few things to consider because of how open_by_handle_at() > works. > > - open_by_handle_at() needs to be restricted so it only creates pidfds > from pidfs file handles that resolve to a struct pid that is > reachable in the caller's pid namespace. In other words, it should > mirror pidfd_open(). > > Put another way, open_by_handle_at() must not be usable to open > arbitrary pids to prevent a container from constructing a pidfd file > handle for a process that lives outside it's pid namespace > hierarchy. > > With this restriction in place open_by_handle_at() can be available > to let unprivileged processes open pidfd file handles. > > Related to that, I don't think we need to make open_by_handle_at() > open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply > because any process in the initial pid namespace can open any other > process via pidfd_open() anyway because pid namespaces are > hierarchical. > > IOW, CAP_DAC_READ_SEARCH must not override the restriction that the > provided pidfs file handle must be reachable from the caller's pid > namespace. > > - open_by_handle_at() uses may_decode_fh() to determine whether it's > possible to decode a file handle as an unprivileged user. The > current checks don't make sense for pidfs. Conceptually, I think > there don't need to place any restrictions based on global > CAP_DAC_READ_SEARCH, owning user namespace of the superblock or > mount on pidfs file handles. > > The only restriction that matters is that the requested pidfs file > handle is reachable from the caller's pid namespace. > > - A pidfd always has exactly a single inode and a single dentry. > There's no aliases. > > - Generally, in my naive opinion, I think that decoding pidfs file > handles should be a lot simpler than decoding regular path based > file handles. Because there should be no need to verify any > ancestors, or reconnect paths. Pidfs also doesn't have directory > inodes, only regular inodes. In other words, any dentry is > acceptable. > > Essentially, the only thing we need is for exportfs_decode_fh_raw() > to verify that the provided pidfs file handle is resolvable in the > caller's pid namespace. If so we're done. The challenge is how to > nicely plumb this into the code without it sticking out like a sore > thumb. > > - Pidfs should not be exportable via NFS. It doesn't make sense. I haven't looked over the patchset yet, but those restrictions all sound pretty reasonable to me. Special casing the may_decode_fh permission checks may be the tricky bit. I'm not sure what that should look like, tbqh. -- Jeff Layton <jlayton@xxxxxxxxxx>