Re: How we use cgroups in rkt

Serge Hallyn <serge.hallyn@xxxxxxxxxx> · Thu, 18 Jun 2015 14:40:24 +0000

Quoting Alban Crequy (alban@xxxxxxxxxxxx):
> On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@xxxxxxxxxx> wrote:
> > Quoting Iago López Galeiras (iago@xxxxxxxxxxxx):
> >> Hi everyone,
> >>
> >> We are working on rkt[1] and we want to ask for feedback about the way we use
> >> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> >> so I guess the best way to start is explaining how this is handled in
> >> systemd-nspawn.
> >>
> >> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> >> the container except the part that corresponds to it inside the systemd
> >> controller. It is done this way because allowing the container to modify the
> >> other controllers is considered unsafe[2].
> >>
> >> This is how bind mounts look like:
> >>
> >> /sys/fs/cgroup/devices RO
> >> [...]
> >> /sys/fs/cgroup/memory RO
> >> /sys/fs/cgroup/systemd RO
> >> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> >>
> >> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> >> container, each running in its own chroot. To implement this concept, we start a
> >> systemd-nspawn container with a minimal systemd installation that starts each
> >> app as a service.
> >>
> >> We want to be able to apply different restrictions to each app of a pod using
> >> cgroups and the straightforward way we thought was delegating to systemd inside
> >> the container. Initially, this didn't work because, as mentioned earlier, the
> >> cgroup controllers are mounted read-only.
> >>
> >> The way we solved this problem was mounting the cgroup hierarchy (with the
> >> directories expected by systemd) outside the container. The difference with
> >> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> >> we leave the knobs we need in each of the application’s subcgroups read-write.
> >>
> >> For example, if we want to restrict the memory usage of an application we leave
> >> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
> >
> > Who exactly does the writing to those files?
> 
> First, rkt prepares systemd a .service file for each application in
> the container with "CPUQuota=" and "MemoryLimit=". The .service files
> are not used by systemd outside the container. Then, rkt uses
> systemd-nspawn to start systemd as pid 1 in the container. Finally,
> systemd inside the container writes to the cgroup files
> {memory.limit_in_bytes,cgroup.procs}.
> 
> We call those limits the "per-app isolators". It's not a security
> boundary because all the apps run in the same container (in the same
> pid/mount/net namespaces). The apps run in different chroots, but
> that's easily escapable.
> 
> > Do the applications want to change them, or only rkt itself?
> 
> At the moment, the limits are statically defined in the app container
> image, so neither rkt or the apps inside the container change them. I
> don't know of a use case where we would need to change them
> dynamically.
> 
> >  If rkt, then it seems like you should be
> > able to use a systemd api to update the values (over dbus), right?
> > systemctl set-property machine-a-scope MemoryLimit=1G or something.
> 
> In addition to the "per-app isolators" described above, rkt can have
> "pod-level isolators" that are applied on the machine slice (the
> cgroup parent directory) rather than at the leaves of the cgroup tree.
> They are defined when rkt itself is started by a systemd .service
> file, and applied by systemd outside of the container. E.g.
> 
> [Service]
> CPUShares=512
> MemoryLimit=1G
> ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4
> 
> Updating the pod-level isolators with systemctl on the host should work.
> 
> But systemd inside the container or the apps don't have access to the
> required cgroup knob files: they are mounted read-only.
> 
> > Now I'm pretty sure that systemd doesn't yet support being able to do
> > this from inside the container in a delegated way.
> 
> Indeed by default nspawn/systemd does not support delegating that. It
> only works because rkt prepared the cgroup bind mounts for the
> container.
> 
> > That was cgmanager's
> > reason for being, and I'm interested in working on a proper API for that
> > for systemd.
> 
> Do you mean patching systemd so it does not write to the cgroup
> filesystem directly but talk to the cgmanager/cgproxy socket instead?

More likely, patch it so it can talk to a systemd-owned unix socket
bind-mounted into the container.  So systemd would need to be patched
at both ends.  But that's something I was hoping would happen upstream
anyway.  I would be very happy to help out in that effort.

-serge
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html