Am 08.01.2015 um 14:45 schrieb Daniel P. Berrange: > On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote: >> Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange: >>> We have historically done a number of things with LXC that are >>> somewhat questionable in retrospect >>> >>> 1. Mounted /proc/sys read-only, but then mounted >>> /proc/sys/net/ipv* read-write again >>> 2. Mounted /sys read only >>> 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN >>> 4. FUSE mount on /proc/meminfo >>> >>> Items 1 & 2 are pointless as they offer no security benefit either >>> with or without user namespaces. Without userns it is always insecure, >>> with userns it is always secure, no matter what the mount state is. >> >> I agree. Thanks a lot for addressing this, Daniel! >> >>> Item 3 is some what dubious, since /proc/self/cgroup paths for >>> processes are now not visible at /sys/fs/cgroup. This really >>> confuses systemd inside the container making it create a broken >>> layout >> >> The question is, how to support systemd in containers? >> >> As of now I'm not aware of a working concept. >> With current libvirt it kind of works but recently I found a very nasty issue: >> See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html > > That reply from Lennart suggests systemd should pretty much work, > albeit in a hacky way. What hack to you mean? *confused* > I've not done much in anger with systemd in containers, but I have > found it sufficient for application containers - ie not full OS > containers with interactive sessions. My use case is different. I need most of the time at least an init. And if the distro is systemd based.... >> >> Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs >> within the container in a secure way. >> The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150 >> >> As of now I have to drop all my systemd lxc guests and will replace them by >> a non-systemd distro, which is very sad. :-( >> >>> Item 4 is some what dubious, since we're only changing some of the >>> fields in /proc/meminfo. It helps apps which blindly parse >>> /proc/meminfo to determine free system resources they can consume. >>> Those apps are broken even without containers being involved though, >>> since any application must expect to be placed inside a cgroup with >>> limited resources. Faking /proc/meminfo is a pretty limited workaround >>> that just delays the inevitable fixing of such apps.. >> >> You mean that tools like free(1) have to be patched to query also >> memory limits from cgroupfs? > > Not neccessarily. The 'free' tool is said to > > "Display amount of free and used memory in the system" > > so it is arguably correct that it reports /proc/meminfo of the host > as a whole. > > What is broken are applications that are invoking 'free' and then > believing that the values it reports correspond to what the > application is able to use. ie the applications are not taking > account that they might not have ability to use the entire system > resources due to cgroups or containers or both. > >>> The patch that follows just removes the items 1 & 2, but I'm thinking >>> we should go further and remove items 3 & 4 too. >>> >>> Changing 4 in particular though is certainly classed as a guest ABI >>> change though, so is not something distros may wish to see when >>> upgrading libvirt. There is scope to argue that 1-3 are guest ABI >>> changes too >>> >>> In full machine virt world, we deal with this using machine types. >>> eg each new KVM version introduces a new machine type which models >>> the guest ABI in a stable fashion. Guest machine types are fixed at >>> time of first deployment. So when libvirt / KVM is upgraded, existing >>> guests will not see any changes, but new guests will automatically >>> get the new machine type. >>> >>> I'm thinking we might want make use of this in LXC before making >>> these changes. eg introduce a new machine 'libvirt-lxc-1' to >>> represent the current guest mount setup and make sure all existing >>> guests get that machine type. Then introduce a new machine type >>> libvirt-lxc-2 that removes all this cruft, which new guests will >>> get by default. >>> >>> Alternatively we could call them 'libvirt-lxc-compat-1' and >>> 'libvirt-lxc-bare-1' to give a clearer indication of their >>> functional difference and version them separately in the future ? >> >> Can we have a new machine type which enforces user namespaces? > > Hmm, I'm not sure that would work. Not least because we need a way to > assume the UID/GID mapping, and the filesystems used with the container > need to have the right UID/GID permissions setup. IOW I don't think > user ns is something we can transparently / automatically turn on. Yeah but we have to warn the user that she is doing something insecure if no mappings are set up. Thanks, //richard -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list