Re: cgroup access daemon

Vrijendra (वृजेन्द्र) Gokhale <vrigo@xxxxxxxxxx> · Fri, 28 Jun 2013 11:51:08 -0700

On Fri, Jun 28, 2013 at 11:37 AM, Tim Hockin <thockin@xxxxxxxxxx> wrote:

> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn <serge.hallyn@xxxxxxxxxx>
> wrote:
> > Quoting Tim Hockin (thockin@xxxxxxxxxx):
> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <serge.hallyn@xxxxxxxxxx>
> wrote:
> >> > Quoting Tim Hockin (thockin@xxxxxxxxxx):
> >> >
> >> >> For our use case this is a huge problem.  We have people who access
> >> >> cgroup files in a fairly tight loops, polling for information.  We
> >> >> have literally hundeds of jobs running on sub-second frequencies -
> >> >> plumbing all of that through a daemon is going to be a disaster.
> >> >> Either your daemon becomes a bottleneck, or we have to build
> something
> >> >> far more scalable than you really want to.  Not to mention the
> >> >> inefficiency of inserting a layer.
> >> >
> >> > Currently you can trivially create a container which has the
> >> > container's cgroups bind-mounted to the expected places
> >> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the
> >> > configuration file, and handle cgroups through cgroupfs there.
> >> > (This is what the management agent wants to be an alternative
> >> > for)  The main deficiency there is that /proc/self/cgroups is
> >> > not filtered, so it will show /lxc/c1 for init's cgroup, while
> >> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
> >> > is seen under the container's /sys/fs/cgroup/devices (for
> >> > instance).  Not ideal.
> >>
> >> I'm really saying that if your daemon is to provide a replacement for
> >> cgroupfs direct access, it needs to be designed to be scalable.  If
> >> we're going to get away from bind mounting cgroupfs into user
> >> namespaces, then let's try to solve ALL the problems.
> >>
> >> >> We also need the ability to set up eventfds for users or to let them
> >> >> poll() on the socket from this daemon.
> >> >
> >> > So you'd want to be able to request updates when any cgroup value
> >> > is changed, right?
> >>
> >> Not necessarily ANY, but that's the terminus of this API facet.
> >>
> >> > That's currently not in my very limited set of commands, but I can
> >> > certainly add it, and yes it would be a simple unix sock so you can
> >> > set up eventfd, select/poll, etc.
> >>
> >> Assuming the protocol is basically a pass-through to basic filesystem
> >> ops, it should be pretty easy.  You just need to add it to your
> >> protocol.
> >>
> >> But it brings up another point - access control.  How do you decide
> >> which files a child agent should have access to?  Does that ever
> >> change based on the child's configuration? In our world, the answer is
> >> almost certainly yes.
> >
> > Could you give examples?
> >
> > If you have a white/academic paper I should go read, that'd be great.
>
> We don't have anything on this, but examples may help.
>
> Someone running as root should be able to connect to the "native"
> daemon and read or write any cgroup file they want, right?  You could
> argue that root should be able to do this to a child-daemon, too, but
> let's ignore that.
>
> But inside a container, I don't want the users to be able to write to
> anything in their own container.  I do want them to be able to make
> sub-cgroups, but only 5 levels deep.  For sub-cgroups, they should be
> able to write to memory.limit_in_bytes, to read but not write
> memory.soft_limit_in_bytes, and not be able to read memory.stat.
>
> To get even fancier, a user should be able to create a sub-cgroup and
> then designate that sub-cgroup as "final" - no further sub-sub-cgroups
> allowed under it.  They should also be able to designate that a
> sub-cgroup is "one-way" - once a process enters it, it can not leave.
>
> These are real(ish) examples based on what people want to do today.
> In particular, the last couple are things that we want to do, but
> don't do today.
>

To elaborate on what Tim mentioned earlier:

A lot of google workloads run third party code (think AppEngine).

The need to create sub-cgroups and move such third party code into those
cgroups to limit their memory/cpu usage is very real.

Monitoring stats for such workloads via polling the cgroup or via eventfds
is imperative.

> The particular policy can differ per-container.  Production jobs might
> be allowed to create sub-cgroups, but batch jobs are not.  Some user
> jobs are designated "trusted" in one facet or another and get more
> (but still not full) access.
>
> > At the moment I'm going on the naive belief that proper hierarchy
> > controls will be enforced (eventually) by the kernel - i.e. if
> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
> > won't be possible for /lxc/c1/lxc/c2 to take that access.
> >
> > The native cgroup manager (the one using cgroupfs) will be checking
> > the credentials of the requesting child manager for access(2) to
> > the cgroup files.
>
> This might be sufficient, or the basis for a sufficient access control
> system for users.  The problem comes that we have multiple jobs on a
> single machine running as the same user.  We need to ensure that the
> jobs can not modify each other.
>
> >> >> >> > So then the idea would be that userspace (like libvirt and lxc)
> would
> >> >> >> > talk over /dev/cgroup to its manager.  Userspace inside a
> container
> >> >> >> > (which can't actually mount cgroups itself) would talk to its
> own
> >> >> >> > manager which is talking over a passed-in socket to the host
> manager,
> >> >> >> > which in turn runs natively (uses cgroupfs, and nests "create
> /c1" under
> >> >> >> > the requestor's cgroup).
> >> >> >>
> >> >> >> How do you handle updates of this agent?  Suppose I have hundreds
> of
> >> >> >> running containers, and I want to release a new version of the
> cgroupd
> >> >> >> ?
> >> >> >
> >> >> > This may change (which is part of what I want to investigate with
> some
> >> >> > POC), but right now I'm building any controller-aware smarts into
> it.  I
> >> >> > think that's what you're asking about?  The agent doesn't do
> "slices"
> >> >> > etc.  This may turn out to be insufficient, we'll see.
> >> >>
> >> >> No, what I am asking is a release-engineering problem.  Suppose we
> >> >> need to roll out a new version of this daemon (some new feature or a
> >> >> bug or something).  We have hundreds of these "child" agents running
> >> >> in the job containers.
> >> >
> >> > When I say "container" I mean an lxc container, with it's own isolated
> >> > rootfs and mntns.  I'm not sure what your "containers" are, but I if
> >> > they're not that, then they shouldn't need to run a child agent.  They
> >> > can just talk over the host cgroup agent's socket.
> >>
> >> If they talk over the host agent's socket, where is the access control
> >> and restriction done?  Who decides how deep I can nest groups?  Who
> >> says which files I may access?  Who stops me from modifying someone
> >> else's container?
> >>
> >> Our containers are somewhat thinner and more managed than LXC, but not
> >> that much.  If we're running a system agent in a user container, we
> >> need to manage that software.  We can't just start up a version and
> >> leave it running until the user decides to upgrade - we force
> >> upgrades.
> >>
> >> >> How do I bring down all these children, and then bring them back up
> on
> >> >> a new version in a way that does not disrupt user jobs (much)?
> >> >>
> >> >> Similarly, what happens when one of these child agents crashes?  Does
> >> >> someone restart it?  Do user jobs just stop working?
> >> >
> >> > An upstart^W$init_system job will restart it...
> >>
> >> What happens when the main agent crashes?  All those children on UNIX
> >> sockets need to reconnect, I guess.  This means your UNIX socket needs
> >> to be a named socket, not just a socketpair(),  making your auth model
> >> more complicated.
> >
> > It is a named socket.
>
> So anyone can connect?  even with SO_PEERCRED, how do you know which
> branches of the cgroup tree I am allowed to modify if the same user
> owns more than one?
>
> >> What happens when the main agent hangs?  Is someone health-checking
> >> it?  How about all the child daemons?
> >>
> >> I guess my main point is that this SOUNDS like a simple project, but
> >
> > I guess it's not "simple".  It just focuses on one specific problem.
> >
> >> if you just do the simple obvious things, it will be woefully
> >> inadequate for anything but simple use-cases.  If we get forced into
> >> such a model (and there are some good reasons to do it, even
> >> disregarding all the other chatter), we'd rather use the same thing
> >> that the upstream world uses, and not re-invent the whole thing
> >> ourselves.
> >>
> >> Do you have a design spec, or a requirements list, or even a prototype
> >> that we can look at?
> >
> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
> > shows what I have in mind.  It (and the sloppy code next to it)
> > represent a few hours' work over the last few days while waiting
> > for compiles and in between emails...
>
> Awesome.  Do you mind if we look?
>
> > But again, it is completely predicated on my goal to have libvirt
> > and lxc (and other cgroup users) be able to use the same library
> > or API to make their requests whether they are on host or in a
> > container, and regardless of the distro they're running under.
>
> I think that is a good goal.  We'd like to not be different, if
> possible.  Obviously, we can't impose our needs on you if you don't
> want to handle them.  It sounds like what you are building is the
> bottom layer in a stack - we (Google) should use that same bottom
> layer.  But that can only happen iff you're open to hearing our
> requirements.  Otherwise we have to strike out on our own or build
> more layers in-between.
>
> Tim
>
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/containers