Re: [RFC 0/5] kernel: Introduce CPU Namespace

Tejun Heo <tj@xxxxxxxxxx> · Mon, 11 Oct 2021 07:42:27 -1000

Hello,

On Mon, Oct 11, 2021 at 04:17:37PM +0200, Michal Koutný wrote:
> The problem as I see it is the mapping from a real dedicated HW to a
> cgroup restricted environment ("container"), which can be shared. In
> this instance, the virtualized view would not be able to represent a
> situation when a CPU is assigned non-exclusively to multiple cpusets.

There is a fundamental problem with trying to represent a resource shared
environment controlled with cgroup using system-wide interfaces including
procfs because the goal of many cgroup resource control includes
work-conservation, which also is one of the main reason why containers are
more attractive in resource-intense deployments. System-level interfaces
naturally describe a discrete system, which can't express the dynamic
distribution with cgroups.

There are aspects of cgroups which are akin to hard partitioning and thus
can be represented by diddling with system level interfaces. Whether those
are worthwhile to pursuit depends on how easy and useful they are; however,
there's no avoiding that each of those is gonna be a very partial and
fragmented thing, which significantly contributes the default cons list of
such attempts.

> > Existing solutions to the problem include userspace tools like LXCFS
> > which can fake the sysfs information by mounting onto the sysfs online
> > file to be in coherence with the limits set through cgroup cpuset.
> > However, LXCFS is an external solution and needs to be explicitly setup
> > for applications that require it. Another concern is also that tools
> > like LXCFS don't handle all the other display mechanism like procfs load
> > stats.
> >
> > Therefore, the need of a clean interface could be advocated for.
> 
> I'd like to write something in support of your approach but I'm afraid that the
> problem of the mapping (dedicated vs shared) makes this most suitable for some
> external/separate entity such as the LCXFS already.

This is more of a unit problem than an interface one - ie. the existing
numbers in the system interface doesn't really fit what needs to be
described.

One approach that we've found useful in practice is dynamically changing
resource consumption based on shortage, as measured by PSI, rather than some
number representing what's available. e.g. for a build service, building a
feedback loop which monitors its own cpu, memory and io pressures and
modulates the number of concurrent jobs.

There are some numbers which would be fundamentlaly useful - e.g. ballpark
number of threads needed to saturate the computing capacity available to the
cgroup, or ballpark bytes of memory available without noticeable contention.
Those, I think we definitely need to work on, but I don't see much point in
trying to bend existing /proc numbers for them.

Thanks.

-- 
tejun