Hello, guys. Here's the write-up I promised last week about what I think are the problems in cgroup and what the current plans are. First of all, it's a mess. Shame on me. Shame on you. Shame on all of us for allowing this mess. Let's all tremble in shame for solid ten seconds before proceeding. I'll list the issues I currently see with cgroup (easier ones first). I think I now have at least tentative plans for all of them and will list them together with the prospective asignees (my wish mostly). Unfortunately, some of the plans involve userland visible changes which would at least cause some discomfort and require adjustments on their part. 1. cpu and cpuacct They cover the same resources and the scheduler cgroup code ends up having to traverse two separate cgroup trees to update the stats. With nested cgroups, the overhead isn't insignificant and it generally is silly. While the use cases for having cpuacct on a separate and likely more granular hierarchy, are somewhat valid, the consensus seems that it's just not worth the trouble and cpuacct should be removed in the long term and we shouldn't allow overlapping controllers for the same resource, especially accounting ones. Solution: * Whine if cpuacct is not co-mounted with cpu. * Make sure cpu has all the stats of cpuacct. If cpu and cpuacct are comounted, don't really mount cpuacct but tell cpu that the user requested it. cpu is updated to create aliases for cpuacct.* files in such cases. This involves special casing cpuacct in cgroup core but I much prefer one-off exception case to adding a generic mechanism for this. * After a while, we can just remove cpuacct completely. * Later on, phase out the aliases too. Who: Me, working on it. 2. memcg's __DEPRECATED_clear_css_refs This is a remnant of another weird design decision of requiring synchronous draining of refcnts on cgroup removal and allowing subsystems to veto cgroup removal - what's the userspace supposed to do afterwards? Note that this also hinders co-mounting different controllers. The behavior could be useful for development and debugging but it unnecessarily interlocks userland visible behavior with in-kernel implementation details. To me, it seems outright wrong (either implement proper severing semantics in the controller or do full refcnting) and disallows, for example, lazy drain of caching refs. Also, it complicates the removal path with try / commit / revert logic which has never been fully correct since the beginning. Currently, the only left user is memcg. Solution: * Update memcg->pre_destroy() such that it never fails. * Drop __DEPRECATED_clear_css_refs and all related logic. Convert pre_destroy() to return void. Who: KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs trigger WARN sooner or later. Let's please get this settled. 3. cgroup_mutex usage outside cgroup core This is another thing which is simply broken. Given the way cgroup is structured and used, nesting cgroup_mutex inside any other commonly used lock simply doesn't work - it's held while invoking controller callbacks which then interact and synchronize with various core subsystems. There are currently three external cgroup_mutex users - cpuset, memcontrol and cgroup_freezer. Solution: Well, we should just stop doing it - use a separate nested lock (which seems possible for cgroup_freezer) or track and mange task in/egress some other way. Who: I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's familiar with the code base takes care of cpuset. Michal, can you please take care of memcg? 4. Make disabled controllers cheaper Mostly through the use of static_keys, I suppose. Making this easier AFAICS depends on resolving #2. The lock dependency loop from #2 makes using static_keys from cgroup callbacks extremely nasty. Solution: Fix #2 and support common pattern from cgroup core. Who: Dunno. Let's see. 5. I CAN HAZ HIERARCHIES? The cpu ones handle nesting correctly - parent's accounting includes children's, parent's configuration affects children's unless explicitly overridden, and children's limits nest inside parent's. memcg asked itself the existential question of to be hierarchical or not and then got confused and decided to become both. When faced with the same question, blkio and cgroup_freezer just gave up and decided to allow nesting and then ignore it - brilliant. And there are others which kinda sorta try to handle hierarchy but only goes way-half. This one is screwed up embarrassingly badly. We failed to establish one of the most basic semantics and can't even define what a cgroup hierarchy is - it depends on each controller and they're mostly wacky! Fortunately, I don't think it will be prohibitively difficult to dig ourselves out of this hole. Solution: * cpu ones seem fine. * For broken controllers, cgroup core will be generating warning messages if the user tries to nest cgroups so that the user at least can know that the behavior may change underneath them later on. For more details, http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902 * memcg can be fully hierarchical but we need to phase out the flat hierarchy support. Unfortunately, this involves flipping the behavior for the existing users. Upstream will try to nudge users with warning messages. Most burden would be on the distros and at least SUSE seems to be on board with it. Needs coordination with other distros. * blkio is the most problematic. It has two sub-controllers - cfq and blk-throttle. Both are utterly broken in terms of hierarchy support and the former is known to have pretty hairy code base. I don't see any other way than just biting the bullet and fixing it. * cgroup_freezer and others shouldn't be too difficult to fix. Who: memcg can be handled by memcg people and I can handle cgroup_freezer and others with help from the authors. The problematic one is blkio. If anyone is interested in working on blkio, please be my guest. Vivek? Glauber? 6. Multiple hierarchies Apart from the apparent wheeeeeeeeness of it (I think I talked about that enough the last time[1]), there's a basic problem when more than one controllers interact - it's impossible to define a resource group when more than two controllers are involved because the intersection of different controllers is only defined in terms of tasks. IOW, if an entity X is of interest to two controllers, there's no way to map X to the cgroups of the two controllers. X may belong to A and B when viewed by one task but A' and B when viewed by another. This already is a head scratcher in writeback where blkcg and memcg have to interact. While I am pushing for unified hierarchy, I think it's necessary to have different levels of granularities depending on controllers given that nesting involves significant overhead and noticeable controller-dependent behavior changes. Solution: I think a unified hierarchy with the ability to ignore subtrees depending on controllers should work. For example, let's assume the following hierarchy. R / \ A B / \ AA AB All controllers are co-mounted. There is per-cgroup knob which controls which controllers nest beyond it. If blkio doesn't want to distinguish AA and AB, the user can specify that blkio doesn't nest beyond A and blkio would see the tree as, R / \ A B While other controllers keep seeing the original tree. The exact form of interface, I don't know yet. It could be a single file which the user echoes [-]controller name into it or per-controller boolean file. I think this level of flexibility should be enough for most use cases. If someone disagrees, please voice your objections now. I *think* this can be achieved by changing where css_set is bound. Currently, a css_set is (conceptually) owned by a task. After the change, a cgroup in the unified hierarchy has its own css_set which tasks point to and can also be used to tag resources as necessary. This way, it should be achieveable without introducing a lot of new code or affecting individual controllers too much. The headache will be the transition period where we'll probably have to support both modes of operation. Oh well.... Who: Li, Glauber and me, I guess? 7. Misc issues * Sort & unique when listing tasks. Even the documentation says it doesn't happen but we have a good hunk of code doing it in cgroup.c. I'm gonna rip it out at some point. Again, if you don't like it, scream. * At the PLC, pjt told me that assinging threads of a cgroup to different cgroups is useful for some use cases but if we're to have a unified hierarchy, I don't think we can continue to do that. Paul, can you please elaborate the use case? * Vivek brought up the issue of distributing resources to tasks and groups in the same cgroup. I don't know. Need to think more about it. Thanks. -- tejun [1] http://thread.gmane.org/gmane.linux.kernel.cgroups/857 _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers