Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 17 Apr 2023 08:25:23 -0400

On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > 
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all.  Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> > 
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access.  It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed.  We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient.  However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted.  When "-f" is needed, we really need to
> 
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.
> 
> > be able to name the target filesystem.
> > 
> > We NEVER want to revalidate anything.  We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> > 
> > Permission checks are a little less clear.  We currently allow any user
> 
> This is all very brittle.
> 
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
> 
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
> 
> The LSM hooks are also called twice and would now also be called once.
> 
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().
> 
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
> 
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
> 
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.
> 

These are good points.

One way around the issues you point out might be to pass down a new
MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
filesystem driver to decide whether it wants to avoid potentially
problematic activity when checking permissions. nfs_permission could
check for that and take a more hands-off approach to the permissions
check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
think that might do what we need.

> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found.  If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> > 
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> > 
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem.  We can only be certain of a safe unmount if
> > a direct path is used.
> > 
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> >  fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> >   *
> >   * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> >   */
> > -int inode_permission(struct mnt_idmap *idmap,
> > -		     struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > +			struct inode *inode, int mask, bool mp)
> >  {
> >  	int retval;
> >  
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> >  			return -EACCES;
> >  	}
> >  
> > -	retval = do_inode_permission(idmap, inode, mask);
> > +	if (mp)
> > +		/* We are looking for a mountpoint and so don't
> > +		 * want to interact with the filesystem at all, just
> > +		 * the dcache and icache.
> > +		 */
> > +		retval = generic_permission(idmap, inode, mask);
> > +	else
> > +		retval = do_inode_permission(idmap, inode, mask);
> >  	if (retval)
> >  		return retval;
> >  
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >  
> >  	return security_inode_permission(inode, mask);
> >  }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @inode:	Inode to check permission on
> > + * @mask:	Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > +		     struct inode *inode, int mask)
> > +{
> > +	return inode_permission_mp(idmap, inode, mask, false);
> > +}
> >  EXPORT_SYMBOL(inode_permission);
> >  
> >  /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> >  #define ND_ROOT_PRESET 1
> >  #define ND_ROOT_GRABBED 2
> >  #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >  
> >  static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> >  {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >  
> >  static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> >  {
> > -	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > +	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > +	    likely(!(flags & LOOKUP_MOUNTPOINT)))
> >  		return dentry->d_op->d_revalidate(dentry, flags);
> >  	else
> >  		return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> >  static inline int may_lookup(struct mnt_idmap *idmap,
> >  			     struct nameidata *nd)
> >  {
> > +	/* If we are looking for a mountpoint and we have the SYS_ADMIN
> > +	 * capability, then we will by-pass the filesys for permission checks
> > +	 * and just use generic_permission().
> > +	 */
> > +	bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> >  	if (nd->flags & LOOKUP_RCU) {
> > -		int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > +		int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> >  		if (err != -ECHILD || !try_to_unlazy(nd))
> >  			return err;
> >  	}
> > -	return inode_permission(idmap, nd->inode, MAY_EXEC);
> > +	return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> >  }
> >  
> >  static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> >  	if (IS_ERR(name))
> >  		return PTR_ERR(name);
> >  	set_nameidata(&nd, dfd, name, root);
> > +	if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > +		nd.state = ND_SYS_ADMIN;
> >  	retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> >  	if (unlikely(retval == -ECHILD))
> >  		retval = path_lookupat(&nd, flags, path);
> > -- 
> > 2.40.0
> > 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>