Re: [PATCH 2/2] pidfd: add pidfdfs

Christian Brauner <brauner@xxxxxxxxxx> · Sun, 18 Feb 2024 12:15:02 +0100

On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> >
> > But I have a really stupid (I know nothing about vfs) question, why do we
> > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> >
> > IIUC, if this pid is freed and then another "struct pid" has the same address
> > we can rely on __wait_on_freeing_inode() ?
> 
> Heh. Maybe it would work, but we really don't want to expose core
> kernel pointers to user space as the inode number.

And then also the property that the inode number is unique for the
system lifetime is extremely useful for userspace and I would like to
retain that property.

> 
> So then we'd have to add extra hackery to that (ie we'd have to
> intercept stat calls, and we'd have to have something else for
> ->d_dname() etc..).
> 
> Those are all things that the VFS does support, but ...
> 
> So I do prefer Christian's new approach, although some of it ends up
> being a bit unclear.
> 
> Christian, can you explain why this:
> 
>         spin_lock(&alias->d_lock);
>         dget_dlock(alias);
>         spin_unlock(&alias->d_lock);
> 
> instead of just 'dget()'?

No reason other than I forgot to switch to dget().

> 
> Also, while I found the old __ns_get_path() to be fairly disgusting, I
> actually think it's simpler and clearer than playing around with the
> dentry alias list. So my expectation on code sharing was that you'd

It's overall probably also cheaper, I think.

> basically lift the old __ns_get_path(), make *that* the helper, and
> just pass it an argument that is the pointer to the filesystem
> "stashed" entry...
> 
> And yes, using "atomic_long_t" for stashed is a crime against
> humanity. It's also entirely pointless. There are no actual atomic
> operations that the code wants except for reading and writing (aka
> READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> "atomic_long_t" buys the code nothing, and only makes things more
> complicated and requires crazy casts.

Yup, I had that as a draft and that introduced struct ino_stash which
contained a dentry pointer and the inode number using cmpxchg(). But I
decided against this because ns_common.h would require to have access to
ino_stash definition so we wouldn't just able to hide it in internal.h
where it should belong.

> 
> So I think the nsfs.c code should be changed to do
> 
> -       atomic_long_t stashed;
> +       struct dentry *stashed;
> 
> and remove the crazy workarounds for using the wrong type.
> 
> Something like the attached patch.
> 
> Then, I think the whole "lockref_get_not_dead()" etc games of
> __ns_get_path() could be extracted out into a helper function that
> takes that "&ns->stashed" pointer as an argument, and now that helper
> could also be used for pidfs, except pidfs obviously just does
> "&pid->stashed" instead.
> 
> Hmm?
> 
> Entirely untested patch. Also, the above idea may be broken because of
> some obvious issue that I didn't think about. Christian?

Yeah, sure. Works for me. Let me play with something.