Quoting Andy Lutomirski (luto@xxxxxxxxxxxxxx): > On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <rob@xxxxxxxxxxx> wrote: > > On 10/08/14 14:31, Andy Lutomirski wrote: > >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman > >> <ebiederm@xxxxxxxxxxxx> wrote: > >>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes: > >>>>> Maybe we want to say that rootfs should not be used if we are going to > >>>>> create containers... > >>> > >>> Today it is an assumption of the vfs that rootfs is mounted. With > >>> rootfs mounted and pivot_root at the base of the mount stack you can > >>> make as minimal of a set of mounts as the vfs allows. > >>> > >>> Removing rootfs from the vfs requires an audit of everything that > >>> manipulates mounts. It is not remotely a local excercise. > >> > >> Would it be a less invasive audit to allow different mount namespaces > >> to have different rootfses? > > > > I.E. The same way different namespaces have different init tasks? > > > > The abstraction containers has implemented here should be logically > > consistent. > > > >>>> Could we have an extra rootfs-like fs that is always completely empty, > >>>> doesn't allow any writes, and can sit at the bottom of container > >>>> namespace hierarchies? If so, and if we add a new syscall that's like > >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch > >>>> to that rootfs then. > >>> > >>> Or equally have something that guarantees that rootfs is empty and > >>> read-only at the time the normal root filesystem is mounted. That is > >>> certainly a much more localized change if we want to go there. > >>> > >>> I am half tempted to suggest that mount --move /some/path / be updated > >>> to make the old / just go away (perhaps to be replaced with a read-only > >>> empty rootfs). That gets us into figuring out if we break userspace > >>> which is a big challenge. > >> > >> Hence my argument for a new syscall or entirely new operation. > > > > I'm still waiting for somebody to explain to my why chroot() shouldn't > > be changed to do this instead of adding a new syscall. (At least when > > mount namespace support is enabled.) > > Because chroot has no effect on the namespace at all. If you fork and > the child chroots, the parent isn't chrooted. And, more importantly > for my example, is a process has it's cwd as /foo, and then it forks > and the child chroots, then parent's ".." isn't changed as a result of > the chroot. > > > > >> mount(2) and friends are way too multiplexed right now. I just found > >> yet another security bug due to the insanely complicated semantics of > >> the vfs syscalls. (Yes, a different one from the one yesterday.) > > > > As the guy who rewrote busybox mount 3 times, and who just implemented a > > brand new one (toybox) from scratch: > > > > It's a bit fiddly, yes. > > > >> A new operation kills several birds with one stone. It could look like: > >> > >> int mntns_change_root(int dfd, const char *path, int flags); > >> > >> return -EPERM if chrooted. > > > > Really? > > Now that CVE-2014-7970 is public: what the heck is pivot_root supposed > to do if the caller is chrooted? The current behavior is obviously > incorrect (it leaks memory), but it's not entirely clear to me what > should happen. I think it should either be disallowed or should have > well-defined semantics. > > For simplicity, if a new syscall for this is added, then I think that > the caller-is-chrooted case should be disallowed. If someone needs it > and can articulate what the semantics should be, then I have no > problem with allowing it going forward. It's not that I'd have a need for that, but rather if for some reason I started out chrooted due to some bogus initramfs, I'd prefer to not have to feel like a criminial and escape the chroot first. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html