It is possible in some situations to rename a file or directory through one mount point such that it can start out inside of a bind mount and after the rename wind up outside of the bind mount. Unfortunately with user namespaces these conditions can be trivially created by creating a bind mount under an existing bind mount. I have identified four situations in which this may be a problem. - __d_path and d_absolute_path need to error on disconnected paths that can not reach some root directory or lsm path based security checks can incorrectly succeed. - Normal path name resolution following .. can lead to a directory that is outside of the original loopback mount. - file handle reconsititution aka exportfs_decode_fh can yield a dentry from which d_parent can be followed up to mnt->sb->s_root, but d_parent can not be followed up to mnt->mnt_root. - Mounts on a path that has been renamed outside of a loopback mount become unreachable, as there is no possible path that can be passed to umount to unmount them. My strategy: o File handle reconsitituion problems can be prevented by enabling the nfsd subtree checks for nfs exports, and open_by_handle_at requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global root. This makes any problems difficult if not impossible to exploit in practice so I have not yet written code to address that issue. o The functions __d_path and d_absolute_path are agumented so that the security modules will not be fed a problematic path to work with. o Following of .. has been agumented to test that after d_parent has been resolved the original directory is connected, and if not an error of -ENOENT is returned. o I do not worry about mounts that are disconnected from their bind mount as these mounts can always be freed by either umount -l on the bind mount they have escaped from, or by freeing the mount namespace. So I do not believe there is an actual problem. Pathname resolution is a common fast path and most of the code in this patchset to support keeping .. from becoming expensive in the common case. After hearing the Al's feedback and running some numbers I have given up attempting to keeping the number of d_ancestor calls during pathname resolution to an absolute minimum. It appears that simply preventing calls d_ancestor unless a directory has escaped is good enough. This change in approach has significantly simplified the code. The big implementation change to note is that I have rewritten d_splice_alias and made some significant progress in cleaning up how the locks are dealt with. The only limitation now is that dentry->d_parent->d_inode->i_mutex is taken in lookup held when d_splice_alias is called. If that ever goes away my new d_splice_alias can easily take all of the locks it needs to rename a directory in the proper order. Does anyone see anything significant that I have missed? These changes are all against v4.2-rc1. For those who like to see everything in a single tree the code is at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing Eric W. Biederman (8): dcache: Handle escaped paths in prepend_path dcache: Reduce the scope of i_lock in d_splice_alias dcache: Clearly separate the two directory rename cases in d_splice_alias mnt: Track which mounts use a dentry as root. dcache: Implement d_common_ancestor dcache: Only read d_flags once is d_is_dir mnt: Track when a directory escapes a bind mount vfs: Test for and handle paths that are unreachable from their mnt_root fs/dcache.c | 193 +++++++++++++++++++++++++++++++++++++------------ fs/mount.h | 9 +++ fs/namei.c | 26 ++++++- fs/namespace.c | 152 +++++++++++++++++++++++++++++++++++++- include/linux/dcache.h | 11 ++- include/linux/mount.h | 1 + 6 files changed, 338 insertions(+), 54 deletions(-) _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers