Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

"NeilBrown" <neilb@xxxxxxx> · Tue, 18 Apr 2023 07:26:34 +1000

On Mon, 17 Apr 2023, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > 
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all.  Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> > 
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access.  It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed.  We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient.  However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted.  When "-f" is needed, we really need to
> 
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.

Possibly I could have been more clear.
If /foo is the mount point for a filesystem that is causing problems and
/foo/bar is the mount point of a different filesystem, which might have
other problems, then I was saying that the simple solution of
   umount -l /foo
which would MNT_DETACH both filesystems is not ideal because there might
be pending IO that will never complete, so the mount can never be
cleaned up.  In such cases we really want to be able to do
   umount -f /foo/bar
and ensure that succeeds.
You can see now that it isn't that MNT_FORCE precludes MNT_DETACH, but
they would be done on different filesystems.

MNT_FORCE is, I think, a good idea and a needed functionality that has
never been implemented well.
MNT_FORCE causes nfs_umount_begin to be called as you noted, which
aborts all pending RPCs on that filesystem.
This might be useful if you have an "ls" running, as it is likely to
abort on the first getdents() failure.  But if you have "ls -l' running,
it will likely just go and try another lstat(), and that is just as
likely to hang - thus still preventing the umount.

What we used to do was find anything using the mount and kill it
(sometime "fuser -k" would work).  These processes would likely be stuck
in a D wait, but "umount -f" would abort the RPCs and the processes
would get unstuck and could respond to the signal.  This would sometimes
take a couple of iterations.

These days we have TASK_KILLABLE so this is much less likely to be
needed - we just kill all the processes and they die promptly.  Most of
the time.  There are a few places that can still block, but which no-one
has had the motivation to fix.

It would be much nicer if we didn't need to kill things.  Even with
"fuser -k" it isn't really the sort of thing you want in a shutdown
script (or in systemd-shutdown).

It would be ideal if the shutdown process could call "umount" and if
that fails with EBUSY, call "umount -f" and be confident of ....
something.  Currently "-f" might inspire hope, but not confidence.

We cannot realistically make MNT_FORCE truly force a umount because
processes really need to die before that happens.  But we could ensure
that the filesystem is quiescent and stays that way.

Maybe we could add a flag to 'struct vfsmount' to say that "unmount has
been forced".  Any attempt to use a fd with such a vfsmnt would fail,
expect close() and anything similar.  Maybe nfs_umount_begin could
iterate all open files on that vfsmnt and purge any cached data so no
background writes were needed.  I think this would be very close to what
we needed.

Then systemd could run umount() and if that failed then
umount(MNT_FORCE) and if that fails umount(MNT_DETACH), with confidence
that there would be not more RPCs generates, so it would be safe to turn
off the network.

BTW, that is another reason that just doing "umount -l /foo" is not
ideal.  We sometimes want to wait for the umount to complete, so we can
remove the backing store or the network.  "umount -l /foo" doesn't let
us wait.  "umount /foo/bar" does.

> 
> > be able to name the target filesystem.
> > 
> > We NEVER want to revalidate anything.  We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> > 
> > Permission checks are a little less clear.  We currently allow any user
> 
> This is all very brittle.
> 
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
> 
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
> 
> The LSM hooks are also called twice and would now also be called once.

I'd prefer "called never"

> 
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().

I didn't pay proper attention to acl_permission() - you are right.

> 
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
> 
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
> 
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.

I was really hoping that existing code could be made to just work.

Thanks for the review.

NeilBrown

> 
> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found.  If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> > 
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> > 
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem.  We can only be certain of a safe unmount if
> > a direct path is used.
> > 
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> >  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> >   *
> >   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> >   */
> > -int inode_permission(struct mnt_idmap *idmap,
> > -		     struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > +			struct inode *inode, int mask, bool mp)
> >  {
> >  	int retval;
> >  
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> >  			return -EACCES;
> >  	}
> >  
> > -	retval = do_inode_permission(idmap, inode, mask);
> > +	if (mp)
> > +		/* We are looking for a mountpoint and so don't
> > +		 * want to interact with the filesystem at all, just
> > +		 * the dcache and icache.
> > +		 */
> > +		retval = generic_permission(idmap, inode, mask);
> > +	else
> > +		retval = do_inode_permission(idmap, inode, mask);
> >  	if (retval)
> >  		return retval;
> >  
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >  
> >  	return security_inode_permission(inode, mask);
> >  }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @inode:	Inode to check permission on
> > + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > +		     struct inode *inode, int mask)
> > +{
> > +	return inode_permission_mp(idmap, inode, mask, false);
> > +}
> >  EXPORT_SYMBOL(inode_permission);
> >  
> >  /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> >  #define ND_ROOT_PRESET 1
> >  #define ND_ROOT_GRABBED 2
> >  #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >  
> >  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> >  {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >  
> >  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> >  {
> > -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
> >  		return dentry->d_op->d_revalidate(dentry, flags);
> >  	else
> >  		return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> >  static inline int may_lookup(struct mnt_idmap *idmap,
> >  			     struct nameidata *nd)
> >  {
> > +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> > +	 * capability, then we will by-pass the filesys for permission checks
> > +	 * and just use generic_permission().
> > +	 */
> > +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> >  	if (nd->flags & LOOKUP_RCU) {
> > -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> >  		if (err != -ECHILD || !try_to_unlazy(nd))
> >  			return err;
> >  	}
> > -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> > +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> >  }
> >  
> >  static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> >  	if (IS_ERR(name))
> >  		return PTR_ERR(name);
> >  	set_nameidata(&nd, dfd, name, root);
> > +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > +		nd.state = ND_SYS_ADMIN;
> >  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> >  	if (unlikely(retval == -ECHILD))
> >  		retval = path_lookupat(&nd, flags, path);
> > -- 
> > 2.40.0
> > 
>