Re: [PATCH 2/2] pidfd: add pidfdfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 18, 2024 at 12:33:41PM +0100, Christian Brauner wrote:
> On Sun, Feb 18, 2024 at 12:15:02PM +0100, Christian Brauner wrote:
> > On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> > > >
> > > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > > >
> > > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > > we can rely on __wait_on_freeing_inode() ?
> > > 
> > > Heh. Maybe it would work, but we really don't want to expose core
> > > kernel pointers to user space as the inode number.
> > 
> > And then also the property that the inode number is unique for the
> > system lifetime is extremely useful for userspace and I would like to
> > retain that property.
> > 
> > > 
> > > So then we'd have to add extra hackery to that (ie we'd have to
> > > intercept stat calls, and we'd have to have something else for
> > > ->d_dname() etc..).
> > > 
> > > Those are all things that the VFS does support, but ...
> > > 
> > > So I do prefer Christian's new approach, although some of it ends up
> > > being a bit unclear.
> > > 
> > > Christian, can you explain why this:
> > > 
> > >         spin_lock(&alias->d_lock);
> > >         dget_dlock(alias);
> > >         spin_unlock(&alias->d_lock);
> > > 
> > > instead of just 'dget()'?
> > 
> > No reason other than I forgot to switch to dget().
> > 
> > > 
> > > Also, while I found the old __ns_get_path() to be fairly disgusting, I
> > > actually think it's simpler and clearer than playing around with the
> > > dentry alias list. So my expectation on code sharing was that you'd
> > 
> > It's overall probably also cheaper, I think.
> > 
> > > basically lift the old __ns_get_path(), make *that* the helper, and
> > > just pass it an argument that is the pointer to the filesystem
> > > "stashed" entry...
> > > 
> > > And yes, using "atomic_long_t" for stashed is a crime against
> > > humanity. It's also entirely pointless. There are no actual atomic
> > > operations that the code wants except for reading and writing (aka
> > > READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> > > "atomic_long_t" buys the code nothing, and only makes things more
> > > complicated and requires crazy casts.
> > 
> > Yup, I had that as a draft and that introduced struct ino_stash which
> > contained a dentry pointer and the inode number using cmpxchg(). But I
> > decided against this because ns_common.h would require to have access to
> > ino_stash definition so we wouldn't just able to hide it in internal.h
> > where it should belong.
> 
> Right, I remember. The annoying thing will be how to cleanly handle this
> without having to pass too many parameters because we need d_fsdata, the
> vfsmount, and the inode->i_fop. So let me see if I can get this to
> something that doesn't look too ugly.

So, I'm running out of time today so I'm appending the draft I jotted
down now (untested).
Roughly, here's what I think would work for both nsfs and pidfs (I got
the hint and will rename from pidfdfs ;)). The thing that makes it a bit
tricky is that we need to indicate to the caller whether we've reused a
stashed or added a new inode/dentry so the caller can put the reference
to the object it took for i_private in case we reused a dentry/inode. On
EAGAIN i_private is property of the fs and will be cleaned up in
->evict(). Alternative is a callback for getting a reference which I
think is also ugly. Better suggestions welcome of course.
>From 281553f0059a889476dba5f4460570ea08ceefe5 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@xxxxxxxxxx>
Date: Sun, 18 Feb 2024 14:50:13 +0100
Subject: [PATCH 1/3] [RFC UNTESTED LIKELY STILL BROKEN] internal: add
 path_from_stashed()

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
 fs/internal.h |  3 ++
 fs/libfs.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..cfddaec6fbf6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,6 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+int path_from_stashed(struct dentry **stashed, unsigned long ino,
+		      struct vfsmount *mnt, const struct file_operations *fops,
+		      void *data, struct path *path);
diff --git a/fs/libfs.c b/fs/libfs.c
index eec6031b0155..af46a83cd476 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1973,3 +1973,81 @@ struct timespec64 simple_inode_init_ts(struct inode *inode)
 	return ts;
 }
 EXPORT_SYMBOL(simple_inode_init_ts);
+
+static inline struct dentry *get_stashed_dentry(struct dentry *stashed)
+{
+	struct dentry *dentry;
+
+	rcu_read_lock();
+	dentry = READ_ONCE(stashed);
+	if (!dentry || !lockref_get_not_dead(&dentry->d_lockref))
+		dentry = NULL;
+	rcu_read_unlock();
+	return dentry;
+}
+
+static struct dentry *stash_dentry(struct dentry **stashed, unsigned long ino,
+				   struct super_block *sb,
+				   const struct file_operations *fops,
+				   void *data)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	dentry = d_alloc_anon(sb);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+
+	inode = new_inode_pseudo(sb);
+	if (!inode) {
+		dput(dentry);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	inode->i_ino = ino;
+	inode->i_flags |= S_IMMUTABLE;
+	inode->i_mode = S_IFREG | S_IRUGO;
+	inode->i_fop = fops;
+	inode->i_private = data;
+	simple_inode_init_ts(inode);
+
+	/* @data is now owned by the fs */
+	d_instantiate(dentry, inode);
+
+	if (cmpxchg(stashed, NULL, dentry)) {
+		d_delete(dentry); /* make sure ->d_prune() does nothing */
+		dput(dentry);
+		cpu_relax();
+		return ERR_PTR(-EAGAIN);
+	}
+
+	return dentry;
+}
+
+/*
+ * Try to retrieve stashed dentry or allocate a new one. Indicate to the caller
+ * whether we reused an existing one by returning 0 or when we added a new one
+ * by returning 1. This allows the caller to put any references. Alternative is
+ * a callback which is ugly.
+ */
+int path_from_stashed(struct dentry **stashed, unsigned long ino,
+		      struct vfsmount *mnt, const struct file_operations *fops,
+		      void *data, struct path *path)
+{
+	struct dentry *dentry;
+	int ret = 0;
+
+	dentry = get_stashed_dentry(*stashed);
+	if (dentry)
+		goto out_path;
+
+	dentry = stash_dentry(stashed, ino, mnt->mnt_sb, fops, data);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+	ret = 1;
+
+out_path:
+	path->dentry = dentry;
+	path->mnt = mntget(mnt);
+	return ret;
+}
-- 
2.43.0

>From ecf6c69f62a3b96926d1a4bb5f43dc18d6a60b0a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@xxxxxxxxxx>
Date: Sun, 18 Feb 2024 14:51:23 +0100
Subject: [PATCH 2/3] [RFC UNTESTED LIKELY STILL BROKEN] nsfs: convert to
 path_from_stashed() helper

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
 fs/nsfs.c                 | 70 +++++++++------------------------------
 include/linux/ns_common.h |  2 +-
 include/linux/proc_ns.h   |  2 +-
 3 files changed, 18 insertions(+), 56 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 34e1e3e36733..31d02fb6cb2e 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -38,7 +38,7 @@ static void ns_prune_dentry(struct dentry *dentry)
 	struct inode *inode = d_inode(dentry);
 	if (inode) {
 		struct ns_common *ns = inode->i_private;
-		atomic_long_set(&ns->stashed, 0);
+		WRITE_ONCE(ns->stashed, NULL);
 	}
 }
 
@@ -56,54 +56,6 @@ static void nsfs_evict(struct inode *inode)
 	ns->ops->put(ns);
 }
 
-static int __ns_get_path(struct path *path, struct ns_common *ns)
-{
-	struct vfsmount *mnt = nsfs_mnt;
-	struct dentry *dentry;
-	struct inode *inode;
-	unsigned long d;
-
-	rcu_read_lock();
-	d = atomic_long_read(&ns->stashed);
-	if (!d)
-		goto slow;
-	dentry = (struct dentry *)d;
-	if (!lockref_get_not_dead(&dentry->d_lockref))
-		goto slow;
-	rcu_read_unlock();
-	ns->ops->put(ns);
-got_it:
-	path->mnt = mntget(mnt);
-	path->dentry = dentry;
-	return 0;
-slow:
-	rcu_read_unlock();
-	inode = new_inode_pseudo(mnt->mnt_sb);
-	if (!inode) {
-		ns->ops->put(ns);
-		return -ENOMEM;
-	}
-	inode->i_ino = ns->inum;
-	simple_inode_init_ts(inode);
-	inode->i_flags |= S_IMMUTABLE;
-	inode->i_mode = S_IFREG | S_IRUGO;
-	inode->i_fop = &ns_file_operations;
-	inode->i_private = ns;
-
-	dentry = d_make_root(inode);	/* not the normal use, but... */
-	if (!dentry)
-		return -ENOMEM;
-	dentry->d_fsdata = (void *)ns->ops;
-	d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
-	if (d) {
-		d_delete(dentry);	/* make sure ->d_prune() does nothing */
-		dput(dentry);
-		cpu_relax();
-		return -EAGAIN;
-	}
-	goto got_it;
-}
-
 int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 		     void *private_data)
 {
@@ -113,10 +65,16 @@ int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 		struct ns_common *ns = ns_get_cb(private_data);
 		if (!ns)
 			return -ENOENT;
-		ret = __ns_get_path(path, ns);
+		ret = path_from_stashed(&ns->stashed, ns->inum, nsfs_mnt,
+					&ns_file_operations, ns, path);
+		if (!ret || ret != -EAGAIN)
+			ns->ops->put(ns);
 	} while (ret == -EAGAIN);
 
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return 0;
 }
 
 struct ns_get_path_task_args {
@@ -163,10 +121,13 @@ int open_related_ns(struct ns_common *ns,
 			return PTR_ERR(relative);
 		}
 
-		err = __ns_get_path(&path, relative);
+		err = path_from_stashed(&ns->stashed, ns->inum, nsfs_mnt,
+					&ns_file_operations, ns, &path);
+		if (!err || err != -EAGAIN)
+			ns->ops->put(ns);
 	} while (err == -EAGAIN);
 
-	if (err) {
+	if (err < 0) {
 		put_unused_fd(fd);
 		return err;
 	}
@@ -249,7 +210,8 @@ bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
 static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
 {
 	struct inode *inode = d_inode(dentry);
-	const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
+	const struct ns_common *ns = inode->i_private;
+	const struct proc_ns_operations *ns_ops = ns->ops;
 
 	seq_printf(seq, "%s:[%lu]", ns_ops->name, inode->i_ino);
 	return 0;
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 0f1d024bd958..7d22ea50b098 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -7,7 +7,7 @@
 struct proc_ns_operations;
 
 struct ns_common {
-	atomic_long_t stashed;
+	struct dentry *stashed;
 	const struct proc_ns_operations *ops;
 	unsigned int inum;
 	refcount_t count;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 49539bc416ce..5ea470eb4d76 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -66,7 +66,7 @@ static inline void proc_free_inum(unsigned int inum) {}
 
 static inline int ns_alloc_inum(struct ns_common *ns)
 {
-	atomic_long_set(&ns->stashed, 0);
+	WRITE_ONCE(ns->stashed, NULL);
 	return proc_alloc_inum(&ns->inum);
 }
 
-- 
2.43.0

>From 9bd2f66776f06621ae4a71d511615272971ef293 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@xxxxxxxxxx>
Date: Sun, 18 Feb 2024 14:52:24 +0100
Subject: [PATCH 3/3] [RFC UNTESTED LIKELY STILL BROKEN] pidfds: convert to
 path_from_stashed() helper

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
 fs/pidfdfs.c        | 43 ++++++++++++++++++++++++++-----------------
 include/linux/pid.h |  1 +
 kernel/pid.c        |  1 +
 3 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index be4e74cec8b9..3e1204553ef2 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -14,6 +14,8 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+#include "internal.h"
+
 struct pid *pidfd_pid(const struct file *file)
 {
 	if (file->f_op != &pidfd_fops)
@@ -161,9 +163,21 @@ static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
 			     d_inode(dentry)->i_ino);
 }
 
+static void pidfdfs_prune_dentry(struct dentry *dentry)
+{
+	struct inode *inode;
+
+	inode = d_inode(dentry);
+	if (inode) {
+		struct pid *pid = inode->i_private;
+		WRITE_ONCE(pid->stashed, NULL);
+	}
+}
+
 const struct dentry_operations pidfdfs_dentry_operations = {
 	.d_delete	= always_delete_dentry,
 	.d_dname	= pidfdfs_dname,
+	.d_prune	= pidfdfs_prune_dentry,
 };
 
 static int pidfdfs_init_fs_context(struct fs_context *fc)
@@ -188,27 +202,22 @@ static struct file_system_type pidfdfs_type = {
 struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
 {
 
-	struct inode *inode;
 	struct file *pidfd_file;
+	struct path path;
+	int ret;
 
-	inode = iget_locked(pidfdfs_sb, pid->ino);
-	if (!inode)
-		return ERR_PTR(-ENOMEM);
-
-	if (inode->i_state & I_NEW) {
-		inode->i_ino = pid->ino;
-		inode->i_mode = S_IFREG | S_IRUGO;
-		inode->i_fop = &pidfd_fops;
-		inode->i_flags |= S_IMMUTABLE;
-		inode->i_private = get_pid(pid);
-		simple_inode_init_ts(inode);
-		unlock_new_inode(inode);
-	}
+	do {
+		ret = path_from_stashed(&pid->stashed, pid->ino, pidfdfs_mnt,
+					&pidfd_fops, pid, &path);
+	} while (ret == -EAGAIN);
 
-	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
-	if (IS_ERR(pidfd_file))
-		iput(inode);
+	if (ret < 0)
+		return ERR_PTR(ret);
 
+	if (ret)
+		get_pid(pid);
+	pidfd_file = dentry_open(&path, flags, current_cred());
+	path_put(&path);
 	return pidfd_file;
 }
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7b6f5deab36a..3d1e817a809f 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -56,6 +56,7 @@ struct pid
 	unsigned int level;
 	spinlock_t lock;
 #ifdef CONFIG_FS_PIDFD
+	struct dentry *stashed;
 	unsigned long ino;
 #endif
 	/* lists of tasks that use this pid */
diff --git a/kernel/pid.c b/kernel/pid.c
index 2c0a9e8f58e2..f2f418ecf232 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -277,6 +277,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 #ifdef CONFIG_FS_PIDFD
+	pid->stashed = NULL;
 	pid->ino = ++pidfdfs_ino;
 #endif
 	for ( ; upid >= pid->numbers; --upid) {
-- 
2.43.0


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux