Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Nov 25, 2023 at 10:01:36PM +0000, Al Viro wrote:
> On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:
> 
> > ack. I'll base the other changes we discussed on top of your branch.
> 
> Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
> and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate

FWIW, ->d_revalidate() has an old unpleasant problem we might try to solve
now.

In non-RCU mode We treat 0 as "invalidate that sucker and do a fresh lookup".
Fine, except that there's a possibility of race here - we'd hit ->d_revalidate()
while another thread was renaming object in question.  Or has just found it
by doing lookup in a place where it had been moved on server.

->d_revalidate() decides that it needs to be looked up on server and forms
a request before rename succeeds.  So NFS (e.g.) request goes out with the
old parent and name.  By the time server sees it, RENAME has been processed
and succeeded.  There's no such file in the old place anymore.

So ->d_revalidate() returns 0... and we proceed to invalidate the dentry.
Which had been moved to *new* place by now.  In that place it's perfectly
valid and does not deserve invalidation.

Scenario when rename had been done not from this client is even worse:

server:/srv/nfs/foo is mounted on /mnt/foo
we state /mnt/foo/bar
/mnt/foo/bar is in dcache
somebody on server renames /srv/nfs/foo/bar to /srv/nfs/foo/barf
process A: stat /mnt/foo/bar/baz.
process B: mount something on /mnt/foo/barf/
process B: no /mnt/foo/barf in dcache, let's look it up
	   found fhandle of /mnt/foo
	   sent LOOKUP "barf" in it
	   got an fhandle and found it matching the inode of /mnt/foo/bar
process A: has reached /mnt/foo/bar and decided to revalidate it.
	   found fhandle of /mnt/foo
	   sent a LOOKUP "bar" in that
	   got "nothing with that name there"
	   ->d_revalidate() returns 0
	   loses CPU
process B: splice the dentry of /mnt/foo/bar to /mnt/foo/barf
	   proceed to mount on top of it
process A: gets CPU back
	   calls d_invalidate() on the dentry that now is /mnt/foo/barf
	   dissolves the mount created by process B

Note that server:/srv/nfs/foo/barf has been there and perfectly valid
since before B has started doing anything.  It has no idea that the
damn thing used to be in a different place and something on the same
client had seen it at the old place once upon a time.  As far as it is
concerned, mount has succeeded and then quietly disappeared.  The mountpoint
is still there - with freshly looked up dentry, since the old one had been
invalidated, but userland doesn't see that, so... WTF?

It's not easy to hit, but I'd expect it to be feasible on SMP KVM, where instead
of A losing CPU we might've had the virtual CPU losing the timeslice on host.

IMO we should only do d_invalidate() if
	* ->d_revalidate() has returned 0
	* dentry is still hashed, still has the same parent and still matches
the name from ->d_compare() POV.
If it doesn't, we should just leave it whereever it has been moved to and
act as if we hadn't seen it in the first place.

In other words, have
d_revalidate(dentry, parent, name, flags) doing the following:
	if no ->d_revalidate
		return 1
	ret = ->d_revalidate(...)
	if (unlikely(ret == 0) && !(flags & LOOKUP_RCU)) {
		spin_lock(&dentry->d_lock);
		if (!d_same_name(dentry, parent, name))
			spin_lock(&dentry->d_lock);
		else
			d_invalidate_locked(dentry);
	}
	return ret

where d_invalidate_locked() would be d_invalidate() sans the initial
spin_lock(&dentry->d_lock);

That would solve that problem, AFAICS.  Objections, anyone?  I'm too
sleepy to put together a patch at the moment, will post after I get
some sleep...

PS: as the matter of fact, it might be a good idea to pass the parent
as explicit argument to ->d_revalidate(), now that we are passing the
name as well.  Look at the boilerplate in the instances; all that
        parent = READ_ONCE(dentry->d_parent);
	dir = d_inode_rcu(parent);
	if (!dir)
		return -ECHILD;
	...
on the RCU side combined with
	parent = dget_parent(dentry);
	dir = d_inode(parent);
	...
	dput(dir);
stuff.

It's needed only because the caller had not told us which directory
is that thing supposed to be in; in non-RCU mode the parent is
explicitly pinned down, no need to play those games.  All we need
is
	dir = d_inode_rcu(parent);
	if (!dir) // could happen only in RCU mode
		return -ECHILD;
assuming we need the parent inode, that is.

So... how about
	int (*d_revalidate)(struct dentry *dentry, struct dentry *parent,
			  const struct qstr *name, unsigned int flags);
since we are touching all instances anyway?




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux