On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote: > On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > Ok, that turned out to be simpler than I had evisioned - unless I made > > horrible mistakes: > > Hmm. Could we do the same for nsfs? Jein (Ja/Yes + Nein/No). We could but it would alter nsfs quite a bit. Before stashing nsfs did (roughly): /* get an unhashed dentry */ dentry = d_alloc_pseudo(sb, &qname); /* get new or find existing inode */ inode = iget_locked(sb, ns->inum); /* Add new or reuse existing dentry atomically */ d_instantiate_unique(dentry, inode); Afaict, that was intended to work because d_instantiate_unique() existed. But I don't think it ever worked before the stashing. I'll get to that below. d_instantiate_unique() used to to find an existing dentry or added a new dentry all with inode->i_lock held. A little after that stashing mechanism for nsfs was added d_instantiate_unique() was split into two: d_exact_alias() and d_splice_alias(). The semantics are different. d_exact_alias() discounts hashed dentries - returns NULL - and only considers unhashed dentries eligible for reuse. If it finds an unhashed alias it will rehash it and return it. And whereas d_instantiate_unique() found and added a new dentry with inode->i_lock held and thus guaranteed that there would only be one dentry the d_exact_alias() followed by d_splice_alias() doesn't because i_lock is dropped in between of course. But even the logic back then didn't work. Because back then nsfs called d_alloc_pseudo() which sets dentry->d_parent = dentry. IOW, creates IS_ROOT() dentries. But __d_instantiate_unique() did: if (alias->d_parent != entry->d_parent) continue; and so would reliably discount d_alloc_pseudo() dentries... So even back then nsfs created aliases for each open of the same namespace. Right now the same problem plus another problem would exist. Consider: /* get an unhashed dentry */ dentry = d_alloc_anon(sb, &qname); /* get new or find existing inode */ inode = iget_locked(sb, ns->inum); /* Add new or reuse existing dentry atomically */ alias = d_exact_alias(dentry, inode); if (!alias) d_splice_alias(inode, dentry); That doesn't work. First, because of the inode->i_lock drop in between d_exact_alias() and d_splice_alias(). Second, because d_exact_alias() discounts IS_ROOT() for reuse. Third, because d_exact_alias() discounts hashed dentries for reuse. Consequently switching to a similar iget_locked() mechanism for nsfs as for pidfdfs right now would mean that there's now not just a single dentry that's reused by all concurrent __ns_get_path() calls. Instead one does get multiple dentries like in the old (likely broken) scheme. For nsfs this might matter because in contrast to anonymous inodes and pidfds these namespace fds can be bind-mounted and thus persisted. And in that case it's nice if you only have a single dentry. Whether that matters in practice in terms of memory I'm not sure. It likely doesn't. Now, for pidfds we don't care. pidfds can never be bind-mounted and there's no reason for that. That doesn't work with anonymous inodes and it doesn't work with pidfdfs. The pidfds_mnt is marked as vfs internal preventing it from being used for mounting. For pidfds we also don't care about multiple dentries for the same inode. Because right now - with pidfds being anonymous inodes - that's exacty what already happens. And that really hasn't been an issue so far so I don't see why it would be an issue now. But, if we wanted this we could solve it without that stashing mechanism as a patch on top of both pidfdfs and nsfs later but it would require some dcache help. I think what we'd need - specifically tailored to both nsfs and pidfdfs - is something like the below. So, this is only expected to work on stuff that does d_alloc_anon(). So has no real name and dentry->d_parent == dentry; basically a bunch of IS_ROOT() dentries. I can add that on top and then pidfds can use that and nsfs as well. But again, I don't need it for pidfdfs to be functional. Up to you. diff --git a/fs/dcache.c b/fs/dcache.c index b813528fb147..7c78b8b1a8b3 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2642,6 +2642,39 @@ void d_add(struct dentry *entry, struct inode *inode) } EXPORT_SYMBOL(d_add); +/* + * Helper for special filesystems that want to recycle the exact same dentry + * over and over. Requires d_alloc_anon() and will reject anything that isn't + * IS_ROOT(). Caller must provide valid inode. + */ +struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode) +{ + struct dentry *alias; + unsigned int hash = entry->d_name.hash; + + if (!inode) + return NULL; + + if (!IS_ROOT(entry)) + return NULL; + + spin_lock(&inode->i_lock); + hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) { + if (alias->d_name.hash != hash) + continue; + if (!d_same_name(alias, entry->d_parent, &entry->d_name)) + continue; + dget_dlock(alias); + spin_unlock(&inode->i_lock); + return alias; + } + + __d_instantiate(entry, inode); + spin_unlock(&inode->i_lock); + security_d_instantiate(entry, inode); + return NULL; +} + /** * d_exact_alias - find and hash an exact unhashed alias * @entry: dentry to add diff --git a/fs/internal.h b/fs/internal.h index b67406435fc0..41b441c7b2a0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -310,3 +310,4 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns); struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap); void mnt_idmap_put(struct mnt_idmap *idmap); +struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode); And then this can become roughly: struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags) { struct path path; struct dentry *dentry, *alias; struct inode *inode; struct file *pidfd_file; dentry = d_alloc_anon(pidfdfs_sb); if (!dentry) return ERR_PTR(-ENOMEM); inode = iget_locked(pidfdfs_sb, pid->ino); if (!inode) { dput(dentry); return ERR_PTR(-ENOMEM); } if (inode->i_state & I_NEW) { inode->i_ino = pid->ino; inode->i_mode = S_IFREG | S_IRUGO; inode->i_fop = &pidfd_fops; inode->i_flags |= S_IMMUTABLE; inode->i_private = get_pid(pid); simple_inode_init_ts(inode); unlock_new_inode(inode); } alias = d_instantiate_unique_anon(dentry, inode); if (alias) { dput(dentry); dentry = alias; } path.dentry = dentry; path.mnt = mntget(pidfdfs_mnt); pidfd_file = dentry_open(&path, flags, current_cred()); path_put(&path); return pidfd_file; } > > Also, quick question: > > > +void pid_init_pidfdfs(struct pid *pid) > > +{ > > + pid->ino = ++pidfdfs_ino; > > +} > > As far as I can tell, the above is only safe because it is serialized by > > spin_lock_irq(&pidmap_lock); Yes, I'm explicitly relying on that because that thing serializes all alloc_pid() calls anyway which came in very handy nice. > > in the only caller. > > Can we please just not have a function at all, and just move it there, > so that it's a whole lot more obvious that that inode generation > really gets us a unique number? Yes, of course. I just did it to avoid an ifdef basically. > > Am I missing something? Nope.