On Sat, Nov 25, 2023 at 10:01:36PM +0000, Al Viro wrote: > On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote: > > > ack. I'll base the other changes we discussed on top of your branch. > > Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly, > and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate FWIW, ->d_revalidate() has an old unpleasant problem we might try to solve now. In non-RCU mode We treat 0 as "invalidate that sucker and do a fresh lookup". Fine, except that there's a possibility of race here - we'd hit ->d_revalidate() while another thread was renaming object in question. Or has just found it by doing lookup in a place where it had been moved on server. ->d_revalidate() decides that it needs to be looked up on server and forms a request before rename succeeds. So NFS (e.g.) request goes out with the old parent and name. By the time server sees it, RENAME has been processed and succeeded. There's no such file in the old place anymore. So ->d_revalidate() returns 0... and we proceed to invalidate the dentry. Which had been moved to *new* place by now. In that place it's perfectly valid and does not deserve invalidation. Scenario when rename had been done not from this client is even worse: server:/srv/nfs/foo is mounted on /mnt/foo we state /mnt/foo/bar /mnt/foo/bar is in dcache somebody on server renames /srv/nfs/foo/bar to /srv/nfs/foo/barf process A: stat /mnt/foo/bar/baz. process B: mount something on /mnt/foo/barf/ process B: no /mnt/foo/barf in dcache, let's look it up found fhandle of /mnt/foo sent LOOKUP "barf" in it got an fhandle and found it matching the inode of /mnt/foo/bar process A: has reached /mnt/foo/bar and decided to revalidate it. found fhandle of /mnt/foo sent a LOOKUP "bar" in that got "nothing with that name there" ->d_revalidate() returns 0 loses CPU process B: splice the dentry of /mnt/foo/bar to /mnt/foo/barf proceed to mount on top of it process A: gets CPU back calls d_invalidate() on the dentry that now is /mnt/foo/barf dissolves the mount created by process B Note that server:/srv/nfs/foo/barf has been there and perfectly valid since before B has started doing anything. It has no idea that the damn thing used to be in a different place and something on the same client had seen it at the old place once upon a time. As far as it is concerned, mount has succeeded and then quietly disappeared. The mountpoint is still there - with freshly looked up dentry, since the old one had been invalidated, but userland doesn't see that, so... WTF? It's not easy to hit, but I'd expect it to be feasible on SMP KVM, where instead of A losing CPU we might've had the virtual CPU losing the timeslice on host. IMO we should only do d_invalidate() if * ->d_revalidate() has returned 0 * dentry is still hashed, still has the same parent and still matches the name from ->d_compare() POV. If it doesn't, we should just leave it whereever it has been moved to and act as if we hadn't seen it in the first place. In other words, have d_revalidate(dentry, parent, name, flags) doing the following: if no ->d_revalidate return 1 ret = ->d_revalidate(...) if (unlikely(ret == 0) && !(flags & LOOKUP_RCU)) { spin_lock(&dentry->d_lock); if (!d_same_name(dentry, parent, name)) spin_lock(&dentry->d_lock); else d_invalidate_locked(dentry); } return ret where d_invalidate_locked() would be d_invalidate() sans the initial spin_lock(&dentry->d_lock); That would solve that problem, AFAICS. Objections, anyone? I'm too sleepy to put together a patch at the moment, will post after I get some sleep... PS: as the matter of fact, it might be a good idea to pass the parent as explicit argument to ->d_revalidate(), now that we are passing the name as well. Look at the boilerplate in the instances; all that parent = READ_ONCE(dentry->d_parent); dir = d_inode_rcu(parent); if (!dir) return -ECHILD; ... on the RCU side combined with parent = dget_parent(dentry); dir = d_inode(parent); ... dput(dir); stuff. It's needed only because the caller had not told us which directory is that thing supposed to be in; in non-RCU mode the parent is explicitly pinned down, no need to play those games. All we need is dir = d_inode_rcu(parent); if (!dir) // could happen only in RCU mode return -ECHILD; assuming we need the parent inode, that is. So... how about int (*d_revalidate)(struct dentry *dentry, struct dentry *parent, const struct qstr *name, unsigned int flags); since we are touching all instances anyway?