On Fri, 26 Jul 2013 11:29:27 +1000 Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Thu, Jul 25, 2013 at 11:49:46AM -0400, Dwight Engen wrote: > > We need to check that userspace callers can only truncate > > preallocated blocks from files they have write access to to prevent > > them from prematurley reclaiming blocks from another user. The > > internal reclaimer will not specify the internal > > XFS_KEOF_FLAGS_PERM_CHECK flag, but callers from userspace will > > have it forced on. > > > > Add check for read-only filesystem to free eofblocks ioctl. > > OK, so let me preface this by saying I understand a lot more about > user namespaces that I did a week ago... > > Firstly, this ioctls allows any user to call this function with > arbitrary uid/gid/prid values here. Regardless of user namespaces, I > think that the only UID/GID that should be allowed to be passed for > a user context is their own UID/GID. There's no point in allowing a > scan if the inode_permission()/uid/gid matching checks are always > going to fail. Hi Dave, I agree this patch isn't really about user namespaces. The reason for using inode_permission instead of just checking eofblocks->gid == current_fsgid is that the caller might want to trim an inode they have group write to but is not current_fsgid. When a caller passes in a uid/gid/projid, they are asking us to consider a smaller subset of inodes for trimming. If they don't pass in uid/gid/projid, the list of inodes is unfiltered and the inode_permission is what ensures they should be able to effect a particular inode. > Secondly, because a user can issue an arbitrary number of concurrent > scans, I'd suggest we serialise all user scans through a per-mount > mutex. i.e. there can only be one user scan in progress at a time. > For users with capable(CAP_SYS_ADMIN), we don't need to serialise > them as we let root in the init userns shoot themselves in the foot > all they want. I can add this. > Thirdly, project quotas are not really a user controlled resource > and so I don't think we want users calling this ioctl to trim > project quotas. Indeed, the files in the project that need trimming > may not even belong to the user running the scan, and hence the > user will never be allowed to trim the EOF blocks. > > Further, project quotas might underly the namespace container and be > used for limiting the disk space the namespace container uses. In > which case, we don't even want access to the project ID scanning > from within the namespace. > > Because of this, I'd suggest that project ID scans need a > "capable(CAP_SYS_ADMIN)" check on them. This means a project ID scan > can only be run from the init_userns - it cannot be done from within > the userns container at all. I don't think this would be preventing anything the user can already do: the user just doesn't specify a projid and so we will just look at all the nodes. Specifying a uid/gid/projid only filters/limits the amount of nodes we'll consider, which I think could be useful ie. the caller passes in a projid which will only trim inodes that match the projid and which the caller has inode_permission to. > ..... > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index ed35584..a80f38c 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -1247,6 +1247,10 @@ xfs_inode_free_eofblocks( > > if (!xfs_inode_match_id(ip, eofb)) > > return 0; > > > > + if ((eofb->eof_flags & XFS_KEOF_FLAGS_PERM_CHECK) > > && > > + inode_permission(VFS_I(ip), MAY_WRITE)) > > + return 0; > > Lastly, what happens if this inode_permission() call occurs in the > context of a kworker thread. What userns does an anonymous kworker > thread run in? It's the init userns, right? so: > > inode_permission(inode, MAY_WRITE); > generic_permission(inode, MAY_WRITE) > inode_capable(inode, CAP_DAC_OVERRIDE) > { > ns = current_user_ns(); > return ns_capable(ns, cap) && > kuid_has_mapping(ns, inode->i_uid); > } > > If we are running in the init_userns as a kernel thread, the > ns_capable() check will return true, and there's always a mapping in > the init namespace so kuid_has_mapping() will return true. Hence > we'll always have write permission to every inode we check, > regardless of the XFS_KEOF_FLAGS_PERM_CHECK flag. > > IOWs, this permission check is useless if we run the scan via a > workqueue. Hence we need to be *very careful* with the way we > execute scans and so this needs big, loud comments around it to > remind us to be careful in the future. Right, the XFS_KEOF_FLAGS_PERM_CHECK flag should not be on for internal contexts so the inode_permission check won't even be done (and it would pass anyway as you mention). I can add a big comment that XFS_KEOF_FLAGS_PERM_CHECK should only be used from process context because it won't work otherwise. > > +#define XFS_KEOF_FLAGS_PERM_CHECK (1 << 31) /* check can > > write inode */ +#if XFS_KEOF_FLAGS_PERM_CHECK & XFS_EOF_FLAGS_VALID > > +#error "Internal XFS_KEOF_FLAGS_PERM_CHECK duplicated bit from > > XFS_EOF_FLAGS_VALID" +#endif > > BUILD_BUG_ON() is the preferred method of doing this. Say, in > xfs_super.c::init_xfs_fs(). Sounds good, will do it there. > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs