Re: Do we really need d_weak_revalidate???

NeilBrown <neilb@xxxxxxxx> · Thu, 17 Aug 2017 09:47:38 +1000

On Wed, Aug 16 2017, Jeff Layton wrote:

> On Wed, 2017-08-16 at 12:43 +1000, NeilBrown wrote:
>> On Mon, Aug 14 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-08-14 at 09:36 +1000, NeilBrown wrote:
>> > > On Fri, Aug 11 2017, Jeff Layton wrote:
>> > > 
>> > > > On Fri, 2017-08-11 at 05:55 +0000, Trond Myklebust wrote:
>> > > > > On Fri, 2017-08-11 at 14:31 +1000, NeilBrown wrote:
>> > > > > > Funny story.  4.5 years ago we discarded the FS_REVAL_DOT superblock
>> > > > > > flag and introduced the d_weak_revalidate dentry operation instead.
>> > > > > > We duly removed the flag from NFS superblocks and NFSv4 superblocks,
>> > > > > > and added the new dentry operation to NFS dentries .... but not to
>> > > > > > NFSv4
>> > > > > > dentries.
>> > > > > > 
>> > > > > > And nobody noticed.
>> > > > > > 
>> > > > > > Until today.
>> > > > > > 
>> > > > > > A customer reports a situation where mount(....,MS_REMOUNT,..) on an
>> > > > > > NFS
>> > > > > > filesystem hangs because the network has been deconfigured.  This
>> > > > > > makes
>> > > > > > perfect sense and I suggested a code change to fix the problem.
>> > > > > > However when a colleague was trying to reproduce the problem to
>> > > > > > validate
>> > > > > > the fix, he couldn't.  Then nor could I.
>> > > > > > 
>> > > > > > The problem is trivially reproducible with NFSv3, and not at all with
>> > > > > > NFSv4.  The reason is the missing d_weak_revalidate.
>> > > > > > 
>> > > > > > We could simply add d_weak_revalidate for NFSv4, but given that it
>> > > > > > has been missing for 4.5 years, and the only time anyone noticed was
>> > > > > > when the ommission resulted in a better user experience, I do wonder
>> > > > > > if
>> > > > > > we need to.  Can we just discard d_weak_revalidate?  What purpose
>> > > > > > does
>> > > > > > it serve?  I couldn't find one.
>> > > > > > 
>> > > > > > Thanks,
>> > > > > > NeilBrown
>> > > > > > 
>> > > > > > For reference, see
>> > > > > > Commit: ecf3d1f1aa74 ("vfs: kill FS_REVAL_DOT by adding a
>> > > > > > d_weak_revalidate dentry op")
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > > To reproduce the problem at home, on a system that uses systemd:
>> > > > > > 1/ place (or find) a filesystem image in a file on an NFS filesystem.
>> > > > > > 2/ mount the nfs filesystem with "noac" - choose v3 or v4
>> > > > > > 3/ loop-mount the filesystem image read-only somewhere
>> > > > > > 4/ reboot
>> > > > > > 
>> > > > > > If you choose v4, the reboot will succeed, possibly after a 90second
>> > > > > > timeout.
>> > > > > > If you choose v3, the reboot will hang indefinitely in systemd-
>> > > > > > shutdown while
>> > > > > > remounting the nfs filesystem read-only.
>> > > > > > 
>> > > > > > If you don't use "noac" it can still hang, but only if something
>> > > > > > slows
>> > > > > > down the reboot enough that attributes have timed out by the time
>> > > > > > that
>> > > > > > systemd-shutdown runs.  This happens for our customer.
>> > > > > > 
>> > > > > > If the loop-mounted filesystem is not read-only, you get other
>> > > > > > problems.
>> > > > > > 
>> > > > > > We really want systemd to figure out that the loop-mount needs to be
>> > > > > > unmounted first.  I have ideas concerning that, but it is messy.  But
>> > > > > > that isn't the only bug here.
>> > > > > 
>> > > > > The main purpose of d_weak_revalidate() was to catch the issues that
>> > > > > arise when someone changes the contents of the current working
>> > > > > directory or its parent on the server. Since '.' and '..' are treated
>> > > > > specially in the lookup code, they would not be revalidated without
>> > > > > special treatment. That leads to issues when looking up files as
>> > > > > ./<filename> or ../<filename>, since the client won't detect that its
>> > > > > dcache is stale until it tries to use the cached dentry+inode.
>> > > > > 
>> > > > > The one thing that has changed since its introduction is, I believe,
>> > > > > the ESTALE handling in the VFS layer. That might fix a lot of the
>> > > > > dcache lookup bugs that were previously handled by d_weak_revalidate().
>> > > > > I haven't done an audit to figure out if it actually can handle all of
>> > > > > them.
>> > > > > 
>> > > > 
>> > > > It may also be related to 8033426e6bdb2690d302872ac1e1fadaec1a5581:
>> > > > 
>> > > >     vfs: allow umount to handle mountpoints without revalidating them
>> > > 
>> > > You say in the comment for that commit:
>> > > 
>> > >      but there
>> > >     are cases where we do want to revalidate the root of the fs.
>> > > 
>> > > Do you happen to remember what those cases are?
>> > > 
>> > 
>> > Not exactly, but I _think_ I might have been assuming that we needed to
>> > ensure that the inode attrs on the root were up to date after the
>> > pathwalk.
>> > 
>> > I think that was probably wrong. d_revalidate is really intended to
>> > ensure that the dentry in question still points to the same inode. In
>> > the case of the root of the mount though, we don't really care about the
>> > dentry on the server at all. We're attaching the root of the mount to an
>> > inode and don't care of the dentry name changes. If we do need to ensure
>> > the inode attrs are updated, we'll just revalidate them at that point.
>> > 
>> > > > 
>> > > > Possibly the fact that we no longer try to revalidate during unmount
>> > > > means that this is no longer necessary?
>> > > > 
>> > > > The original patch that added d_weak_revalidate had a reproducer in the
>> > > > patch description. Have you verified that that problem is still not
>> > > > reproducible when you remove d_weak_revalidate?
>> > > 
>> > > I did try the reproducer and it works as expected both with and without
>> > > d_weak_revalidate.
>> > > On reflection, the problem it displayed was caused by d_revalidate()
>> > > being called when the dentry name was irrelevant.  We remove that
>> > > (fixing the problem) and introduce d_weak_revalidate because we thought
>> > > that minimum functionality was still useful.  I'm currently not
>> > > convinced that even that is needed.
>> > > 
>> > > If we discarded d_weak_revalidate(), we could get rid of the special
>> > > handling of umount....
>> > 
>> > I like idea. I say go for it and we can see what (if anything) breaks?
>> 
>> Getting rid of d_weak_revalidate is easy enough - hardly any users.
>> 
>> Getting rid of filename_mountpoint() isn't so easy unfortunately.
>> autofs4 uses kern_path_mountpoint() - presumably to avoid getting stuck
>> in autofs4_d_manage()?  It would be a shame to keep this infrastructure
>> around just so that one part of autofs4 can talk to another part of
>> autofs4.
>> 
>> Do you know if the fact that filename_mountpoint() skips ->d_manage is
>> important for sys_umount ??
>> 
>> Thanks,
>> NeilBrown
>> 
>
> (cc'ing David and Ian)
>
> I'm less familiar with the automounting machinery, but I imagine you
> don't really want to go triggering new mounts when your intent is to
> unmount something. As long as that doesn't happen I'd think we'd be ok
> here.

New mounts don't get triggered unless asked for (LOOKUP_AUTOMOUNT) so
that isn't a problem.  But the d_manage() operation does get called, and
it can block.  I think the purpose of the interface is to avoid races
with expiring automounts.  Once the decision has been made to expire the
mount, you want to be able to stop anyone getting a new access.
d_manage() can do that.  umount() probably doesn't need d_manage() to be
skipped.

autofs *does* for certain ioctls which are used to re-establish
ownership of an automount (e.g. when restarting automountd).  The owner
never gets blocked, but if the new automountd isn't the owner yet, it
needs help.

The current code causes the lookup to never call d_manage(), which I
think is overkill.  There might be some other filesystem which uses
d_manage() (not actually possible at present) and those shouldn't be
ignored.

So I came up with this.  It isn't exactly elegant, but it should avoid
the deadlock that caused autofs4 to start using kern_path_mount().

If we apply this change, and get rid of d_weak_revalidate(), then I
think we can get rid of all the "mountpoint" pathname lookup code.

Ian/David - you do have any thoughts on this approach?

Thanks,
NeilBrown

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index beef981aa54f..478a9aafd6d3 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -271,3 +271,4 @@ static inline void autofs4_del_expiring(struct dentry *dentry)
 }
 
 void autofs4_kill_sb(struct super_block *);
+int is_autofs_finder(void);
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index dd9f1bebb5a3..f3e84bfed85a 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -200,6 +200,13 @@ static int autofs_dev_ioctl_protosubver(struct file *fp,
 	return 0;
 }
 
+static DEFINE_MUTEX(autofs_find);
+static struct task_struct *autofs_finder = NULL;
+int is_autofs_finder(void)
+{
+	return autofs_finder == current;
+}
+
 /* Find the topmost mount satisfying test() */
 static int find_autofs_mount(const char *pathname,
 			     struct path *res,
@@ -209,7 +216,12 @@ static int find_autofs_mount(const char *pathname,
 	struct path path;
 	int err;
 
-	err = kern_path_mountpoint(AT_FDCWD, pathname, &path, 0);
+	mutex_lock(&autofs_find);
+	autofs_finder = current;
+	err = kern_path(pathname, 0, &path);
+	autofs_finder = NULL;
+	mutex_unlock(&autofs_find);
+
 	if (err)
 		return err;
 	err = -ENOENT;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index d79ced925861..e93a3112d60c 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -438,7 +438,7 @@ static int autofs4_d_manage(const struct path *path, bool rcu_walk)
 	pr_debug("dentry=%p %pd\n", dentry, dentry);
 
 	/* The daemon never waits. */
-	if (autofs4_oz_mode(sbi)) {
+	if (autofs4_oz_mode(sbi) || is_autofs_finder()) {
 		if (!path_is_mountpoint(path))
 			return -EISDIR;
 		return 0;
Attachment:
signature.asc

Description: PGP signature