Hello, This email is to restart the discussion around the thread granularity support for cgroup cpu controller, which got lost around the following message. http://thread.gmane.org/gmane.linux.kernel/2021959/focus=14454 While the previous discussion didn't reach a conclusion, it uncovered the points of disagreements. As the thread became too difficult to follow, let's summarize and revisit each technical point. cgroup v1 started out thread-granular and later grew process-granular operations. Thread-granular operations have some issues, which will be partially discussed in this message, and cgroup v2 is process-granular. As a result, hierarchical resource distribution among the threads of the same process isn't covered by the cgroup v2 interface proper. For some controllers, especially cpu, in-process hierarchical resource distribution is important. This message discusses the in-process support for cpu controller - where it belongs, how it should look like and why. cpuset can also benefit from thread granularity; however, the situation around cpuset is murkier, so let's stay away from it for now. cpuset's issues are more about how to deal with CPU availability in general than cgroup behavior. 1. Goal The goal of thread granularity support for cpu controller can be summarized as Hierarchically organize threads of a process and control CPU cycle distribution along the hierarchy. 2. Background and Stuff to Consider 2-1. In-process Hierarchy in v1 In the v1 interface, there is no distinction between system-wide and in-process cgroup organizations. Everything happens through the same cgroupfs and in-process organization is entangled with everything else. Either the cgroup manager is directly involved or the in-process sub-hierarchy is delegated to the process itself. While seemingly simple, the interlocking of the two different domains causes a number of issues. The role of each thread in a process is information private to the process in the sense that there is no reliable way of finding out from outside without the process itself explicitly making the information available. Consequently, if an external manager is involved in the management of in-process organization, each such process has to communicate with it. It's one thing to make system management software depend on userland facility, something completely different to make normal applications depend on an external userland manager for operations as intimate as thread management. It will make the feature a lot more cumbersome and less useful. While sub-hierarchy delegation doesn't seem to create direct external dependencies, cgroupfs doesn't provide enough facilities for such delegations to work. For example, there is no way for a thread to access its own subhierarchy atomically. It has to read a couple of files to construct the path but may be moved to a different cgroup at any moment making it access the wrong cgroup. Also, it isn't clear who is responsible for in-process organization. System management and normal applications still need to coordinate. Both cases suffer from the kernel failing to provide proper separation between system management and usual programming interfaces. This entangles system management and normal applications making in-process resource control awkward and useless. 2-2. Ownership of In-process Organization In-process hierarchies can't be implemented without active participation from the target application for two reasons. First, a given thread's role is a piece of information private to the application. Second, as a new thread is put into the parent's cgroups, organization is inherently tied to how threads are created. Note the contrast against system-level management. The only thing necessary for cgroup support at system-level is starting each application in the right cgroups. No cooperation is necessary. Lacking clear ownership of in-process organization leads to other issues too. For example, an application can't be sure that the in-process organization it created remains unchanged. Threads may have been moved around. Some may not even be in the process sub-hierarchy at all. On v1, such accidents can easily happen among processes sharing the same credentials. Also, the hierarchy itself could have changed. A cgroup may have been removed, renamed or replaced behind the process's back. This makes in-process organization fragile without adding any gains to the goal - in-process hierarchical resource distribution. 2-3. Management and Application Interfaces In cgroup, the basic operations require strict coordinations among its users and there are oddities such as name collisions between sub-cgroup names and interface files, a notification mechanism which involves forking or the need for explicit cleanup. cgroup is much more of a system management interface than a general application interface. This also shows in scalability. cgroup assumes that organization operations are infrequent and the synchronization scheme is geared toward minimizing hot path overheads. This is perfectly acceptable for a system management mechanism but a non-starter for a widely used application interface. For example, stemming from the architecture, migration is a fairly heavy operation. This doesn't matter for system management and is even desirable because it allows for aggressive optimization of the hot paths; however, hundreds of threads using it in parallel from userspace could bog down the entire machine. While some have been using in-process hierarchies, it works only because the use cases are self-contained and limited. If the kernel wants to expose general hierarchical in-process resource distribution to normal applications, we must evaluate the requirements necessary to achieve the target functionality and make active trade-off to build a robust interface with the right balance. It also makes sense to take a conservative approach by default as we can always loosen up but not tighten down. 2-4. Cost of Membership Dynamism There is an intrinsic trade-off between how dynamic something is and how expensive or difficult synchronizing around it is - dynamism doesn't come free. This applies well to cgroup as the cost and complexity of tracking a resource or a task's cgroup membership depends strongly on how dynamic that relationship is. At the system level, cgroup membership is dynamic in a way which aggressively trades migration overhead for lower hot path overhead. This isn't an issue because when a sysadmin or system management software modifies cgroup membership too frequently, it's easy to tell them to not do that; however, if cgroup membership migration is exposed as a general programming interface, such an approach is no longer viable. If supporting that level of dynamism is something which brings essential benefits, we can make that choice and pay in terms of added complexity and overhead in hot paths; however, this definitely isn't something we want to be committed to by simply being dragged into it for historical reasons. 3. Design Choices There are several important abstract design choices which are independent from implementation details. As it is easy to miss them in a deluge of details, let's discuss the larger design points and then work our way to a specific implementation. 3-1. Exclusive Ownership of In-process Organization As discussed, the target process must be an active participant in thread organization and depends on the organization not changing behind its back. Given those, it is logical to make in-process organization owned exclusively by each process. It gets rid of all ambiguities and the accompanying failure modes without losing core functionalities. 3-2. Static Grouping Changing cgroup membership of a thread is all but guaranteed to be more expensive than scheduling an existing thread which is already in the target cgroup. This implies that there always is a better way to implement execution of a chunk of work in a remote cgroup than moving a thread into the cgroup. In addition, establishing in-process hierarchical resource distribution is a significant step and it makes sense to start as restricted as possible while achieving the core functionalities. It is logical to start with a model where in-process cgroup membership is determined on thread creation and remains immutable. This avoids exposing membership dynamism to normal applications, which will be expensive in terms of both complexity and hot path overhead. It also clearly signals that assignment of cgroup membership is an operation at least as expensive as thread creation and naturally excludes usages where cgroup membership is changed very frequently. 3-3. Extending the Thread Control Interfaces cgroup has a pseudo filesystem interface at system level, which is great for interface flexibility; however, as an interface exposed to normal applications, it is unusual and awkward. Any operation is a multi-step process and it isn't difficult to create a sub-cgroup whose name collides with one of the interface files. In-process hierarchical resource distribution shouldn't stand out like a sore thumb. If it can be implemented as a natural extension of the existing patterns and mechanisms, that is the right direction to take. As in-process structure follows clone(2) history, it has natural similarities to how processes are organized - e.g. the traditional process hierarchy or namespace. On the resource control side, the existing rlimit facility has inherent similarities. One possible upside of exposing cgroupfs to normal applications is reuse of existing cgroup libraries; however, the part which can be reused is mostly encapsulation of multistep filesystem operations into a more programmable interface. It doesn't make any sense to cling onto the partial compatibility when the main benefit can be replaced by the kernel providing a more programmable interface. There's no reason to deviate from existing programmable interfaces for in-process hierarchical resource distribution. It can be implemented as a natural extension of existing facilities and it should be. 4. Interface Proposal 4-1. In-process Organization In-process hierarchy is separate from the system-level cgroup hierarchy. It is invisible from cgroupfs interface and transparent for all operations - e.g. when a process is migrated to a different cgroup, the whole in-process hierarchy is atomically moved as-is. By default, a new thread is put into the same in-process group as the parent. If explicitly indicated, e.g. CLONE_NEWRESGROUP, a new group which is a child of the parent's group is created and the thread is put into it. The group is identified by the TID of the thread and stays around while there are sub-groups or threads in it. For in-process use, TID based identification is enough; however, it can be useful to allow modifying resource settings from outside. To allow identifying each group from outside, a new prctl(2) operation can be introduced, e.g. PR_SET_RESGROUP_NAME, which can be called from any thread and sets the name of the group that the calling thread belongs to. The mapping between group IDs and names can be published in the process's /proc. 5-2. Resource Control Settings Resource control settings can be implemented as a natural extension of the rlimit facility. get/setrlimit(2) and prlimit(2) provide all that's necessary to read and modify resource settings by the process itself and from outside. The only interface change needed is adding the matching RLMIT_ resource tags. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html