On Fri, Jun 28, 2013 at 11:37 AM, Tim Hockin <thockin@xxxxxxxxxx> wrote: > On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn <serge.hallyn@xxxxxxxxxx> > wrote: > > Quoting Tim Hockin (thockin@xxxxxxxxxx): > >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <serge.hallyn@xxxxxxxxxx> > wrote: > >> > Quoting Tim Hockin (thockin@xxxxxxxxxx): > >> > > >> >> For our use case this is a huge problem. We have people who access > >> >> cgroup files in a fairly tight loops, polling for information. We > >> >> have literally hundeds of jobs running on sub-second frequencies - > >> >> plumbing all of that through a daemon is going to be a disaster. > >> >> Either your daemon becomes a bottleneck, or we have to build > something > >> >> far more scalable than you really want to. Not to mention the > >> >> inefficiency of inserting a layer. > >> > > >> > Currently you can trivially create a container which has the > >> > container's cgroups bind-mounted to the expected places > >> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the > >> > configuration file, and handle cgroups through cgroupfs there. > >> > (This is what the management agent wants to be an alternative > >> > for) The main deficiency there is that /proc/self/cgroups is > >> > not filtered, so it will show /lxc/c1 for init's cgroup, while > >> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what > >> > is seen under the container's /sys/fs/cgroup/devices (for > >> > instance). Not ideal. > >> > >> I'm really saying that if your daemon is to provide a replacement for > >> cgroupfs direct access, it needs to be designed to be scalable. If > >> we're going to get away from bind mounting cgroupfs into user > >> namespaces, then let's try to solve ALL the problems. > >> > >> >> We also need the ability to set up eventfds for users or to let them > >> >> poll() on the socket from this daemon. > >> > > >> > So you'd want to be able to request updates when any cgroup value > >> > is changed, right? > >> > >> Not necessarily ANY, but that's the terminus of this API facet. > >> > >> > That's currently not in my very limited set of commands, but I can > >> > certainly add it, and yes it would be a simple unix sock so you can > >> > set up eventfd, select/poll, etc. > >> > >> Assuming the protocol is basically a pass-through to basic filesystem > >> ops, it should be pretty easy. You just need to add it to your > >> protocol. > >> > >> But it brings up another point - access control. How do you decide > >> which files a child agent should have access to? Does that ever > >> change based on the child's configuration? In our world, the answer is > >> almost certainly yes. > > > > Could you give examples? > > > > If you have a white/academic paper I should go read, that'd be great. > > We don't have anything on this, but examples may help. > > Someone running as root should be able to connect to the "native" > daemon and read or write any cgroup file they want, right? You could > argue that root should be able to do this to a child-daemon, too, but > let's ignore that. > > But inside a container, I don't want the users to be able to write to > anything in their own container. I do want them to be able to make > sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be > able to write to memory.limit_in_bytes, to read but not write > memory.soft_limit_in_bytes, and not be able to read memory.stat. > > To get even fancier, a user should be able to create a sub-cgroup and > then designate that sub-cgroup as "final" - no further sub-sub-cgroups > allowed under it. They should also be able to designate that a > sub-cgroup is "one-way" - once a process enters it, it can not leave. > > These are real(ish) examples based on what people want to do today. > In particular, the last couple are things that we want to do, but > don't do today. > To elaborate on what Tim mentioned earlier: A lot of google workloads run third party code (think AppEngine). The need to create sub-cgroups and move such third party code into those cgroups to limit their memory/cpu usage is very real. Monitoring stats for such workloads via polling the cgroup or via eventfds is imperative. > The particular policy can differ per-container. Production jobs might > be allowed to create sub-cgroups, but batch jobs are not. Some user > jobs are designated "trusted" in one facet or another and get more > (but still not full) access. > > > At the moment I'm going on the naive belief that proper hierarchy > > controls will be enforced (eventually) by the kernel - i.e. if > > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it > > won't be possible for /lxc/c1/lxc/c2 to take that access. > > > > The native cgroup manager (the one using cgroupfs) will be checking > > the credentials of the requesting child manager for access(2) to > > the cgroup files. > > This might be sufficient, or the basis for a sufficient access control > system for users. The problem comes that we have multiple jobs on a > single machine running as the same user. We need to ensure that the > jobs can not modify each other. > > >> >> >> > So then the idea would be that userspace (like libvirt and lxc) > would > >> >> >> > talk over /dev/cgroup to its manager. Userspace inside a > container > >> >> >> > (which can't actually mount cgroups itself) would talk to its > own > >> >> >> > manager which is talking over a passed-in socket to the host > manager, > >> >> >> > which in turn runs natively (uses cgroupfs, and nests "create > /c1" under > >> >> >> > the requestor's cgroup). > >> >> >> > >> >> >> How do you handle updates of this agent? Suppose I have hundreds > of > >> >> >> running containers, and I want to release a new version of the > cgroupd > >> >> >> ? > >> >> > > >> >> > This may change (which is part of what I want to investigate with > some > >> >> > POC), but right now I'm building any controller-aware smarts into > it. I > >> >> > think that's what you're asking about? The agent doesn't do > "slices" > >> >> > etc. This may turn out to be insufficient, we'll see. > >> >> > >> >> No, what I am asking is a release-engineering problem. Suppose we > >> >> need to roll out a new version of this daemon (some new feature or a > >> >> bug or something). We have hundreds of these "child" agents running > >> >> in the job containers. > >> > > >> > When I say "container" I mean an lxc container, with it's own isolated > >> > rootfs and mntns. I'm not sure what your "containers" are, but I if > >> > they're not that, then they shouldn't need to run a child agent. They > >> > can just talk over the host cgroup agent's socket. > >> > >> If they talk over the host agent's socket, where is the access control > >> and restriction done? Who decides how deep I can nest groups? Who > >> says which files I may access? Who stops me from modifying someone > >> else's container? > >> > >> Our containers are somewhat thinner and more managed than LXC, but not > >> that much. If we're running a system agent in a user container, we > >> need to manage that software. We can't just start up a version and > >> leave it running until the user decides to upgrade - we force > >> upgrades. > >> > >> >> How do I bring down all these children, and then bring them back up > on > >> >> a new version in a way that does not disrupt user jobs (much)? > >> >> > >> >> Similarly, what happens when one of these child agents crashes? Does > >> >> someone restart it? Do user jobs just stop working? > >> > > >> > An upstart^W$init_system job will restart it... > >> > >> What happens when the main agent crashes? All those children on UNIX > >> sockets need to reconnect, I guess. This means your UNIX socket needs > >> to be a named socket, not just a socketpair(), making your auth model > >> more complicated. > > > > It is a named socket. > > So anyone can connect? even with SO_PEERCRED, how do you know which > branches of the cgroup tree I am allowed to modify if the same user > owns more than one? > > >> What happens when the main agent hangs? Is someone health-checking > >> it? How about all the child daemons? > >> > >> I guess my main point is that this SOUNDS like a simple project, but > > > > I guess it's not "simple". It just focuses on one specific problem. > > > >> if you just do the simple obvious things, it will be woefully > >> inadequate for anything but simple use-cases. If we get forced into > >> such a model (and there are some good reasons to do it, even > >> disregarding all the other chatter), we'd rather use the same thing > >> that the upstream world uses, and not re-invent the whole thing > >> ourselves. > >> > >> Do you have a design spec, or a requirements list, or even a prototype > >> that we can look at? > > > > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README > > shows what I have in mind. It (and the sloppy code next to it) > > represent a few hours' work over the last few days while waiting > > for compiles and in between emails... > > Awesome. Do you mind if we look? > > > But again, it is completely predicated on my goal to have libvirt > > and lxc (and other cgroup users) be able to use the same library > > or API to make their requests whether they are on host or in a > > container, and regardless of the distro they're running under. > > I think that is a good goal. We'd like to not be different, if > possible. Obviously, we can't impose our needs on you if you don't > want to handle them. It sounds like what you are building is the > bottom layer in a stack - we (Google) should use that same bottom > layer. But that can only happen iff you're open to hearing our > requirements. Otherwise we have to strike out on our own or build > more layers in-between. > > Tim > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers