On Wed, Jan 29, 2014 at 12:35:25PM +0100, Piotr Bartosiewicz wrote: > > On 28.01.2014 12:46, Daniel P. Berrange wrote: > >On Tue, Jan 28, 2014 at 12:32:41PM +0100, Jan Olszak wrote: > >>Hi there! > >> > >>I am trying to turn on user namespace by adding following lines to the > >>config: > >> > >> > >> <idmap> > >> > >> <uid start='0' target='0' count='100000'/> > >> > >> <gid start='0' target='0' count='100000'/> > >> > >> </idmap> > >> > >> > >>As you can see the root in container is mapped to the root outside. I was > >>expected to see no difference after adding this lines, but unfortunately > >>there are some (see details below). > >> > >>Am I missing something or is there a problem with system, libvirt or kernel? > >I've not had any chance to try LXC + user namespaces + systemd yet, but > >based on the list of things which fail, it seems like it might not be > >detecting that it is inside a container. Seems almost like it has still > >got the CAP_MKNOD permission and so is strying to start things it should > >not have like udev, and various filesystems. > > > >Daniel > > I was able to reduce the problem by not using libvirt nor systemd. > > I've created a bash process inside user namespace with mapping > root_inside<->root_outside. > I've used a program from https://lwn.net/Articles/532593/ : > ./userns_child_exec -U -M '0 0 1' -G '0 0 1' bash > This program simply calls clone with CLONE_NEWUSER flag and set > proper uid_map and gid_map. > > The test commands are as follows: > mkdir /test > mount debugfs /test -t debugfs > > and strace shows: > mount("debugfs", "/test", "debugfs", MS_MGC_VAL, NULL) = -1 EPERM > (Operation not permitted) > > > Now the question is: > Is it a kernel bug or expected behavior ie. inside user namespace we > have always limited permissions even if uid=0 inside container is > mapped to uid=0 outside? uid==0 inside the container will not have exactly the same permissions as uid==0 in the host. The reason is due to the way the kernel is checking capabilities. When a syscall requires CAP_SYS_ADMIN, for example, the kernel will either use capable(CAP_SYS_ADMIN) which only succeeds in the host, or ns_capable(CAP_SYS_ADMIN) which is allowed to suceed in the container. Different filesystems have differing restrictions, but at this time the vast majority of filesystems require that capable(CAP_SYS_ADMIN) succeeed and thus you can only mount them in the host. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| _______________________________________________ libvirt-users mailing list libvirt-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvirt-users