It is possible in some situations to rename a file or directory through one mount point such that it can start out inside of a bind mount and after the rename wind up outside of the bind mount. Unfortunately with user namespaces these conditions can be trivially created by creating a bind mount under an existing bind mount. I have identified four situations in which this may be a problem. - __d_path and d_absolute_path need to error on disconnected paths that can not reach some root directory or lsm path based security checks can incorrectly succeed. - Normal path name resolution following .. can lead to a directory that is outside of the original loopback mount. - file handle reconsititution aka exportfs_decode_fh can yield a dentry from which d_parent can be followed up to mnt->sb->s_root, but d_parent can not be followed up to mnt->mnt_root. - Mounts on a path that has been renamed outside of a loopback mount become unreachable, as there is no possible path that can be passed to umount to unmount them. My strategy: o File handle reconsitituion problems can be prevented by enabling the nfsd subtree checks for nfs exports, and open_by_handle_at requires capable(CAP_DAC_READ_SEARCH) so is only usable by the global root. This makes any problems difficult if not impossible to exploit in practice so I have not yet written code to address that issue. o The functions __d_path and d_absolute_path are agumented so that the security modules will not be fed a problematic path to work with. o Following of .. has been agumented to test that after d_parent has been resolved the original directory is connected, and if not an error of -ENOENT is returned. o I do not worry about mounts that are disconnected from their bind mount as these mounts can always be freed by either umount -l on the bind mount they have escaped from, or by freeing the mount namespace. So I do not believe there is an actual problem. That name resolution is a common fast path and most of the code in this patchset is to support keeping following .. from becoming quadratic as far as is humanly possible. For the implementation I went back to the drawing board and carefully read through the affected code, so I could be certain I knew what was going on, and this wound of with some very significant implementation changes from a correctness point of view. On each mount I keep an escape count which is almost but not quite a seqcount that is bumped each time a directory escapes a mount point. This allows marking the mounts that do have directories escape and allows caching of when a path has been verified to have no escapes, so in the common case even a mount that has had a directory escape will see only a single call to d_ancestor during path name resolution the first time .. is encountered. I have not benchmarked the code but I don't see any reason to expect anything except for rename will see a performance impact, and then only in cases with where a rename potentially affects allows a directory to escape lots of mounts. Do I have something that is good enough this time, or am I blind and missing something? These changes are all against v4.2-rc4. For those who like to see everything in a single tree the code is at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing Eric W. Biederman (6): mnt: Track which mounts use a dentry as root. dcache: Handle escaped paths in prepend_path dcache: Implement d_common_ancestor mnt: Track when a directory escapes a bind mount vfs: Test for and handle paths that are unreachable from their mnt_root vfs: Cache the results of path_connected fs/dcache.c | 90 ++++++++++++++++-- fs/mount.h | 25 +++++ fs/namei.c | 59 +++++++++++- fs/namespace.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/dcache.h | 8 ++ 5 files changed, 409 insertions(+), 16 deletions(-) _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers