On Tue, May 2, 2017 at 6:33 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > On Tue, May 2, 2017 at 7:29 AM, Djalal Harouni <tixxdz@xxxxxxxxx> wrote: >> On Thu, Apr 27, 2017 at 12:13 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >>> On Tue, Apr 25, 2017 at 5:23 AM, Djalal Harouni <tixxdz@xxxxxxxxx> wrote: >> [...] >>>> We have to align procfs and modernize it to have a per mount context >>>> where at least the mount option do not propagate to all other mounts, >>>> then maybe we can continue to implement new features. One example is to >>>> require CAP_SYS_ADMIN in the init user namespace on some /proc/* which are >>>> not pids and which are are not virtualized by design, or CAP_NET_ADMIN >>>> inside userns on the net bits that are virtualized, etc. >>>> These mount options won't propagate to previous mounts, and the system >>>> will continue to be usable. >>>> >>>> Ths patch introduces the new 'limit_pids' mount option as it was also >>>> suggesed by Andy Lutomirski [1]. When this option is passed we >>>> automatically create a private procfs instance. This is not the default >>>> behaviour since we do not want to break userspace and we do not want to >>>> provide different devices IDs by default, please see [1] for why. >>> >>> I think that calling the option to make a separate instance >>> "limit_pids" is extremely counterintuitive. >> >> Ok. >> >>> My strong preference would be to make proc *always* make a separate >>> instance (unless it's a bind mount) and to make it work. If that >>> means fudging stat() output, so be it. >> >> I also agree, but as said if we change stat(), userspace won't be able >> to notice if these two proc instances are really separated, the device >> ID is the only indication here. > > I re-read all the threads and I'm still not convinced I see why we > need new_instance to be non-default. It's true that the device > numbers of /proc/ns/* matter, but if you look (with stat -L, for > example), they're *already* not tied to the procfs instance. Hmm, indeed, so the namespace FDs point internally to the internal proc mount that is created during pidns initialization, this means NS_GET_PARENT ioctl won't change which is good, only things that relate on stat()ing other inodes may notice. > > I'm okay with adding new_instance to be on the safe side, but I'd like > it to be done in a way that we could make it become the default some > day without breaking anything. This means that we need to be rather > careful about how new_instance and hidepid interact. Sounds good, from the devpts history it seems that "newinstance" was used to absorb new changes/updates easily, and it was made a no-op only recently with commit eedf265aa003b4 "devpts: Make each mount of devpts an independent filesystem." last year, where the initial introduction was via commit 2a1b2dc0c83bbfc24 "Enable multiple instances of devpts" in 2009 Starting from this: 1) "hidepid" works withe the "gid" membership option which is sticky, I would like to avoid this combination, plus 2) "hidepid" now changes the pid namespace option. With "newinstance" set: * "hidepid" instead of changing the pid namespace options, it will only affect the new procfs instance. * Changing "hidepid" value during a remount of a *private* procfs instance will only affect that procfs instance and not the pid namespace or the other shared procfs mounts. * "pids=ptraceable" makes /proc/ show only pids that the caller can ptrace. Together with NO_NEW_PRIVS set, it makes a good privacy measure. "pids=ptraceable" is also for *LSM* so we guarantee that there is a ptrace security hook there for LSMs and that there are no relations or exceptions between "pids=ptraceable" and "hidepid" / "gid" mount options. This will benefit Yama LSM later. * "pids=ptraceable" will take precedence over "hidepid" I assume defaulting later to new instances should continue to work, comments ? Thanks! -- tixxdz