Quoting Alban Crequy (alban@xxxxxxxxxxxx): > On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@xxxxxxxxxx> wrote: > > Quoting Iago López Galeiras (iago@xxxxxxxxxxxx): > >> Hi everyone, > >> > >> We are working on rkt[1] and we want to ask for feedback about the way we use > >> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally > >> so I guess the best way to start is explaining how this is handled in > >> systemd-nspawn. > >> > >> The approach taken by nspawn is mounting the cgroup controllers read-only inside > >> the container except the part that corresponds to it inside the systemd > >> controller. It is done this way because allowing the container to modify the > >> other controllers is considered unsafe[2]. > >> > >> This is how bind mounts look like: > >> > >> /sys/fs/cgroup/devices RO > >> [...] > >> /sys/fs/cgroup/memory RO > >> /sys/fs/cgroup/systemd RO > >> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW > >> > >> In rkt we have a concept called pod[3] which is a list of apps that run inside a > >> container, each running in its own chroot. To implement this concept, we start a > >> systemd-nspawn container with a minimal systemd installation that starts each > >> app as a service. > >> > >> We want to be able to apply different restrictions to each app of a pod using > >> cgroups and the straightforward way we thought was delegating to systemd inside > >> the container. Initially, this didn't work because, as mentioned earlier, the > >> cgroup controllers are mounted read-only. > >> > >> The way we solved this problem was mounting the cgroup hierarchy (with the > >> directories expected by systemd) outside the container. The difference with > >> systemd-nspawn’s approach is that we don’t mount everything read-only; instead, > >> we leave the knobs we need in each of the application’s subcgroups read-write. > >> > >> For example, if we want to restrict the memory usage of an application we leave > >> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs} > > > > Who exactly does the writing to those files? > > First, rkt prepares systemd a .service file for each application in > the container with "CPUQuota=" and "MemoryLimit=". The .service files > are not used by systemd outside the container. Then, rkt uses > systemd-nspawn to start systemd as pid 1 in the container. Finally, > systemd inside the container writes to the cgroup files > {memory.limit_in_bytes,cgroup.procs}. > > We call those limits the "per-app isolators". It's not a security > boundary because all the apps run in the same container (in the same > pid/mount/net namespaces). The apps run in different chroots, but > that's easily escapable. > > > Do the applications want to change them, or only rkt itself? > > At the moment, the limits are statically defined in the app container > image, so neither rkt or the apps inside the container change them. I > don't know of a use case where we would need to change them > dynamically. > > > If rkt, then it seems like you should be > > able to use a systemd api to update the values (over dbus), right? > > systemctl set-property machine-a-scope MemoryLimit=1G or something. > > In addition to the "per-app isolators" described above, rkt can have > "pod-level isolators" that are applied on the machine slice (the > cgroup parent directory) rather than at the leaves of the cgroup tree. > They are defined when rkt itself is started by a systemd .service > file, and applied by systemd outside of the container. E.g. > > [Service] > CPUShares=512 > MemoryLimit=1G > ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4 > > Updating the pod-level isolators with systemctl on the host should work. > > But systemd inside the container or the apps don't have access to the > required cgroup knob files: they are mounted read-only. > > > Now I'm pretty sure that systemd doesn't yet support being able to do > > this from inside the container in a delegated way. > > Indeed by default nspawn/systemd does not support delegating that. It > only works because rkt prepared the cgroup bind mounts for the > container. > > > That was cgmanager's > > reason for being, and I'm interested in working on a proper API for that > > for systemd. > > Do you mean patching systemd so it does not write to the cgroup > filesystem directly but talk to the cgmanager/cgproxy socket instead? More likely, patch it so it can talk to a systemd-owned unix socket bind-mounted into the container. So systemd would need to be patched at both ends. But that's something I was hoping would happen upstream anyway. I would be very happy to help out in that effort. -serge _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers