On 16/03/2021 20:24, Kees Cook wrote: > On Tue, Mar 16, 2021 at 08:04:09PM +0100, Jann Horn wrote: >> On Tue, Mar 16, 2021 at 6:02 PM Mickaël Salaün <mic@xxxxxxxxxxx> wrote: >>> One could argue that chroot(2) is useless without a properly populated >>> root hierarchy (i.e. without /dev and /proc). However, there are >>> multiple use cases that don't require the chrooting process to create >>> file hierarchies with special files nor mount points, e.g.: >>> * A process sandboxing itself, once all its libraries are loaded, may >>> not need files other than regular files, or even no file at all. >>> * Some pre-populated root hierarchies could be used to chroot into, >>> provided for instance by development environments or tailored >>> distributions. >>> * Processes executed in a chroot may not require access to these special >>> files (e.g. with minimal runtimes, or by emulating some special files >>> with a LD_PRELOADed library or seccomp). >>> >>> Unprivileged chroot is especially interesting for userspace developers >>> wishing to harden their applications. For instance, chroot(2) and Yama >>> enable to build a capability-based security (i.e. remove filesystem >>> ambient accesses) by calling chroot/chdir with an empty directory and >>> accessing data through dedicated file descriptors obtained with >>> openat2(2) and RESOLVE_BENEATH/RESOLVE_IN_ROOT/RESOLVE_NO_MAGICLINKS. >> >> I don't entirely understand. Are you writing this with the assumption >> that a future change will make it possible to set these RESOLVE flags >> process-wide, or something like that? > > I thought it meant "open all out-of-chroot dirs as fds using RESOLVE_... > flags then chroot". As in, there's no way to then escape "up" for the > old opens, and the new opens stay in the chroot. Yes, that was the idea. > >> [...] >>> diff --git a/fs/open.c b/fs/open.c >> [...] >>> +static inline int current_chroot_allowed(void) >>> +{ >>> + /* >>> + * Changing the root directory for the calling task (and its future >>> + * children) requires that this task has CAP_SYS_CHROOT in its >>> + * namespace, or be running with no_new_privs and not sharing its >>> + * fs_struct and not escaping its current root (cf. create_user_ns()). >>> + * As for seccomp, checking no_new_privs avoids scenarios where >>> + * unprivileged tasks can affect the behavior of privileged children. >>> + */ >>> + if (task_no_new_privs(current) && current->fs->users == 1 && >> >> this read of current->fs->users should be using READ_ONCE() > > Ah yeah, good call. I should remember this when I think "can this race?" > :P >