Re: [PATCH 2/2] pidfd: add pidfdfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > Ok, that turned out to be simpler than I had evisioned - unless I made
> > horrible mistakes:
> 
> Hmm. Could we do the same for nsfs?

Jein (Ja/Yes + Nein/No).
We could but it would alter nsfs quite a bit.

Before stashing nsfs did (roughly):

/* get an unhashed dentry */
dentry = d_alloc_pseudo(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
d_instantiate_unique(dentry, inode);

Afaict, that was intended to work because d_instantiate_unique()
existed. But I don't think it ever worked before the stashing. I'll get
to that below.

d_instantiate_unique() used to to find an existing dentry or added a new
dentry all with inode->i_lock held.

A little after that stashing mechanism for nsfs was added
d_instantiate_unique() was split into two: d_exact_alias() and
d_splice_alias(). The semantics are different. d_exact_alias() discounts
hashed dentries - returns NULL - and only considers unhashed dentries
eligible for reuse. If it finds an unhashed alias it will rehash it and
return it.

And whereas d_instantiate_unique() found and added a new dentry with
inode->i_lock held and thus guaranteed that there would only be one
dentry the d_exact_alias() followed by d_splice_alias() doesn't because
i_lock is dropped in between of course.

But even the logic back then didn't work. Because back then nsfs called
d_alloc_pseudo() which sets dentry->d_parent = dentry. IOW, creates
IS_ROOT() dentries. But __d_instantiate_unique() did:

if (alias->d_parent != entry->d_parent)
        continue;

and so would reliably discount d_alloc_pseudo() dentries... So even back
then nsfs created aliases for each open of the same namespace.

Right now the same problem plus another problem would exist. Consider:

/* get an unhashed dentry */
dentry = d_alloc_anon(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
alias = d_exact_alias(dentry, inode);
if (!alias)
	d_splice_alias(inode, dentry);

That doesn't work. First, because of the inode->i_lock drop in between
d_exact_alias() and d_splice_alias(). Second, because d_exact_alias()
discounts IS_ROOT() for reuse. Third, because d_exact_alias() discounts
hashed dentries for reuse.

Consequently switching to a similar iget_locked() mechanism for nsfs as
for pidfdfs right now would mean that there's now not just a single
dentry that's reused by all concurrent __ns_get_path() calls. Instead
one does get multiple dentries like in the old (likely broken) scheme.

For nsfs this might matter because in contrast to anonymous inodes and
pidfds these namespace fds can be bind-mounted and thus persisted. And
in that case it's nice if you only have a single dentry. Whether that
matters in practice in terms of memory I'm not sure. It likely doesn't.

Now, for pidfds we don't care. pidfds can never be bind-mounted and
there's no reason for that. That doesn't work with anonymous inodes and
it doesn't work with pidfdfs. The pidfds_mnt is marked as vfs internal
preventing it from being used for mounting.

For pidfds we also don't care about multiple dentries for the same
inode. Because right now - with pidfds being anonymous inodes - that's
exacty what already happens. And that really hasn't been an issue so far
so I don't see why it would be an issue now.

But, if we wanted this we could solve it without that stashing mechanism
as a patch on top of both pidfdfs and nsfs later but it would require
some dcache help.

I think what we'd need - specifically tailored to both nsfs and pidfdfs
- is something like the below. So, this is only expected to work on
stuff that does d_alloc_anon(). So has no real name and dentry->d_parent
== dentry; basically a bunch of IS_ROOT() dentries. I can add that on
top and then pidfds can use that and nsfs as well. But again, I don't
need it for pidfdfs to be functional. Up to you.

diff --git a/fs/dcache.c b/fs/dcache.c
index b813528fb147..7c78b8b1a8b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2642,6 +2642,39 @@ void d_add(struct dentry *entry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_add);

+/*
+ * Helper for special filesystems that want to recycle the exact same dentry
+ * over and over. Requires d_alloc_anon() and will reject anything that isn't
+ * IS_ROOT(). Caller must provide valid inode.
+ */
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode)
+{
+       struct dentry *alias;
+       unsigned int hash = entry->d_name.hash;
+
+       if (!inode)
+               return NULL;
+
+       if (!IS_ROOT(entry))
+               return NULL;
+
+       spin_lock(&inode->i_lock);
+       hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+               if (alias->d_name.hash != hash)
+                       continue;
+               if (!d_same_name(alias, entry->d_parent, &entry->d_name))
+                       continue;
+               dget_dlock(alias);
+               spin_unlock(&inode->i_lock);
+               return alias;
+       }
+
+       __d_instantiate(entry, inode);
+       spin_unlock(&inode->i_lock);
+       security_d_instantiate(entry, inode);
+       return NULL;
+}
+
 /**
  * d_exact_alias - find and hash an exact unhashed alias
  * @entry: dentry to add
diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..41b441c7b2a0 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,4 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode);

And then this can become roughly:

struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
{
        struct path path;
        struct dentry *dentry, *alias;
        struct inode *inode;
        struct file *pidfd_file;

        dentry = d_alloc_anon(pidfdfs_sb);
        if (!dentry)
                return ERR_PTR(-ENOMEM);

        inode = iget_locked(pidfdfs_sb, pid->ino);
        if (!inode) {
                dput(dentry);
                return ERR_PTR(-ENOMEM);
        }

        if (inode->i_state & I_NEW) {
                inode->i_ino = pid->ino;
                inode->i_mode = S_IFREG | S_IRUGO;
                inode->i_fop = &pidfd_fops;
                inode->i_flags |= S_IMMUTABLE;
                inode->i_private = get_pid(pid);
                simple_inode_init_ts(inode);
                unlock_new_inode(inode);
        }

        alias = d_instantiate_unique_anon(dentry, inode);
        if (alias) {
                dput(dentry);
                dentry = alias;
        }

        path.dentry = dentry;
        path.mnt = mntget(pidfdfs_mnt);

        pidfd_file = dentry_open(&path, flags, current_cred());
        path_put(&path);
        return pidfd_file;
}

> 
> Also, quick question:
> 
> > +void pid_init_pidfdfs(struct pid *pid)
> > +{
> > +       pid->ino = ++pidfdfs_ino;
> > +}
> 
> As far as I can tell, the above is only safe because it is serialized by
> 
>         spin_lock_irq(&pidmap_lock);

Yes, I'm explicitly relying on that because that thing serializes all
alloc_pid() calls anyway which came in very handy nice.

> 
> in the only caller.
> 
> Can we please just not have a function at all, and just move it there,
> so that it's a whole lot more obvious that that inode generation
> really gets us a unique number?

Yes, of course. I just did it to avoid an ifdef basically.

> 
> Am I missing something?

Nope.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux