Hello, On Mon, Oct 11, 2021 at 04:17:37PM +0200, Michal Koutný wrote: > The problem as I see it is the mapping from a real dedicated HW to a > cgroup restricted environment ("container"), which can be shared. In > this instance, the virtualized view would not be able to represent a > situation when a CPU is assigned non-exclusively to multiple cpusets. There is a fundamental problem with trying to represent a resource shared environment controlled with cgroup using system-wide interfaces including procfs because the goal of many cgroup resource control includes work-conservation, which also is one of the main reason why containers are more attractive in resource-intense deployments. System-level interfaces naturally describe a discrete system, which can't express the dynamic distribution with cgroups. There are aspects of cgroups which are akin to hard partitioning and thus can be represented by diddling with system level interfaces. Whether those are worthwhile to pursuit depends on how easy and useful they are; however, there's no avoiding that each of those is gonna be a very partial and fragmented thing, which significantly contributes the default cons list of such attempts. > > Existing solutions to the problem include userspace tools like LXCFS > > which can fake the sysfs information by mounting onto the sysfs online > > file to be in coherence with the limits set through cgroup cpuset. > > However, LXCFS is an external solution and needs to be explicitly setup > > for applications that require it. Another concern is also that tools > > like LXCFS don't handle all the other display mechanism like procfs load > > stats. > > > > Therefore, the need of a clean interface could be advocated for. > > I'd like to write something in support of your approach but I'm afraid that the > problem of the mapping (dedicated vs shared) makes this most suitable for some > external/separate entity such as the LCXFS already. This is more of a unit problem than an interface one - ie. the existing numbers in the system interface doesn't really fit what needs to be described. One approach that we've found useful in practice is dynamically changing resource consumption based on shortage, as measured by PSI, rather than some number representing what's available. e.g. for a build service, building a feedback loop which monitors its own cpu, memory and io pressures and modulates the number of concurrent jobs. There are some numbers which would be fundamentlaly useful - e.g. ballpark number of threads needed to saturate the computing capacity available to the cgroup, or ballpark bytes of memory available without noticeable contention. Those, I think we definitely need to work on, but I don't see much point in trying to bend existing /proc numbers for them. Thanks. -- tejun