On Sat, Jun 06, 2015 at 01:56:54PM -0500, Eric W. Biederman wrote: > Richard Weinberger <richard@xxxxxx> writes: > > > [CC'ing libvirt-lxc folks] > > > > Am 28.05.2015 um 23:32 schrieb Eric W. Biederman: > >> Richard Weinberger <richard@xxxxxx> writes: > >> > >>> Am 28.05.2015 um 21:57 schrieb Eric W. Biederman: > >>>>> FWIW, it breaks also libvirt-lxc: > >>>>> Error: internal error: guest failed to start: Failed to re-mount /proc/sys on /proc/sys flags=1021: Operation not permitted > >>>> > >>>> Interesting. I had not anticipated a failure there? And it is failing > >>>> in remount? Oh that is interesting. > >>>> > >>>> That implies that there is some flag of the original mount of /proc that > >>>> the remount of /proc/sys is clearing, and that previously > >>>> > >>>> The flags specified are current rdonly,remount,bind so I expect there > >>>> are some other flags on proc that libvirt-lxc is clearing by accident > >>>> and we did not fail before because the kernel was not enforcing things. > >>> > >>> Please see: > >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l933 > >>> lxcContainerMountBasicFS() > >>> > >>> and: > >>> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/lxc/lxc_container.c;h=9a9ae5c2aaf0f90ff472f24fda43c077b44998c7;hb=HEAD#l850 > >>> lxcBasicMounts > >>> > >>>> What are the mount flags in a working libvirt-lxc? > >>> > >>> See: > >>> test1:~ # cat /proc/self/mountinfo > >>> 149 147 0:56 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw > >>> 150 149 0:56 /sys /proc/sys ro,nodev,relatime - proc proc rw > >> > >>> If you need more info, please let me know. :-) > >> > >> Oh interesting I had not realized libvirt-lxc had grown an unprivileged > >> mode using user namespaces. > >> > >> This does appear to be a classic remount bug, where you are not > >> preserving the permissions. It appears the fact that the code > >> failed to enforce locked permissions on the fresh mount of proc > >> was hiding this bug until now. > >> > >> I expect what you actually want is the code below: > >> > >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > >> index 9a9ae5c2aaf0..f008a7484bfe 100644 > >> --- a/src/lxc/lxc_container.c > >> +++ b/src/lxc/lxc_container.c > >> @@ -850,7 +850,7 @@ typedef struct { > >> > >> static const virLXCBasicMountInfo lxcBasicMounts[] = { > >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> > >> Or possibly just: > >> > >> diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > >> index 9a9ae5c2aaf0..a60ccbd12bfc 100644 > >> --- a/src/lxc/lxc_container.c > >> +++ b/src/lxc/lxc_container.c > >> @@ -850,7 +850,7 @@ typedef struct { > >> > >> static const virLXCBasicMountInfo lxcBasicMounts[] = { > >> { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > >> - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > >> + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, true, false, false }, > >> { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > >> { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > >> { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > >> > >> As the there is little point in making /proc/sys read-only in a > >> user-namespace, as the permission checks are uid based and no-one should > >> have the global uid 0 in your container. Making mounting /proc/sys > >> read-only rather pointless. > > > > Eric, using the patch below I was able to spawn a user-namespace enabled container > > using libvirt-lxc. :-) > > > > I had to: > > 1. Disable the read-only mount of /proc/sys which is anyway useless in the user-namespace case. > > 2. Disable the /proc/sys/net/ipv{4,6} bind mounts, this ugly hack is only needed for the non user-namespace case. > > 3. Remove MS_RDONLY from the sysfs mount (For the non user-namespace case we'd have to keep this, though). > > > > Daniel, I'd take this as a chance to disable all the MS_RDONLY games if user-namespace are configured. > > With Eric's fixes they hurt us. And as I wrote many times before if root within the user-namespace > > is able to do nasty things in /sys and /proc that's a plain kernel bug which needs fixing. There is no > > point in mounting these read-only. Except for the case then no user-namespace is used. > > > > For clarity the patch below appears to be the minimal change needed to > fix this security issue. > > AKA add mnt_mflags in when remounting something read-only. > > /proc/sys needed to be updated so it had the proper flags to be added > back in. > > I hope this helps. > > diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c > index 9a9ae5c2aaf0..11e9514e0761 100644 > --- a/src/lxc/lxc_container.c > +++ b/src/lxc/lxc_container.c > @@ -850,7 +850,7 @@ typedef struct { > > static const virLXCBasicMountInfo lxcBasicMounts[] = { > { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, > - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, > + { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, > { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, > { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, > @@ -1030,7 +1030,7 @@ static int lxcContainerMountBasicFS(bool userns_enabled, > > if (bindOverReadonly && > mount(mnt_src, mnt->dst, NULL, > - MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) { > + MS_BIND|MS_REMOUNT|mnt_mflags|MS_RDONLY, NULL) < 0) { > virReportSystemError(errno, > _("Failed to re-mount %s on %s flags=%x"), > mnt_src, mnt->dst, Thanks Richard / Eric for the suggested patches. I'll apply Eric's simplified patch to libvirt now, and backport it to our stable libvirt branches. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html