On 5/22/23 15:49, Tejun Heo wrote:
Hello, Waiman.
Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote:
...
cpuset.cpus.reserve
A read-write multiple values file which exists only on root
cgroup.
It lists all the CPUs that are reserved for adjacent and remote
partitions created in the system. See the next section for
more information on what an adjacent or remote partitions is.
Creation of adjacent partition does not require touching this
control file as CPU reservation will be done automatically.
In order to create a remote partition, the CPUs needed by the
remote partition has to be written to this file first.
A "+" prefix can be used to indicate a list of additional
CPUs that are to be added without disturbing the CPUs that are
originally there. For example, if its current value is "3-4",
echoing ""+5" to it will change it to "3-5".
Once a remote partition is destroyed, its CPUs have to be
removed from this file or no other process can use them. A "-"
prefix can be used to remove a list of CPUs from it. However,
removing CPUs that are currently used in existing partitions
may cause those partitions to become invalid. A single "-"
character without any number can be used to indicate removal
of all the free CPUs not allocated to any partitions to avoid
accidental partition invalidation.
Why is the syntax different from .cpus? Wouldn't it be better to keep them
the same?
Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs
that are used in multiple partitions. Also automatic reservation of
adjacent partitions can happen in parallel. That is why I think it will
be safer if we allow incremental increase or decrease of reserve CPUs to
be used for remote partitions. I will include this reasoning into the
doc file.
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
and is not delegatable.
It accepts only the following input values when written to.
========== =====================================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
========== =====================================
A cpuset partition is a collection of cgroups with a partition
root at the top of the hierarchy and its descendants except
those that are separate partition roots themselves and their
descendants. A partition has exclusive access to the set of
CPUs allocated to it. Other cgroups outside of that partition
cannot use any CPUs in that set.
There are two types of partitions - adjacent and remote. The
parent of an adjacent partition must be a valid partition root.
Partition roots of adjacent partitions are all clustered around
the root cgroup. Creation of adjacent partition is done by
writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent.
So a remote partition can be formed far from the root cgroup.
However, its creation is a 2-step process. The CPUs needed
by a remote partition ("cpuset.cpus" of the partition root)
has to be written into "cpuset.cpus.reserve" of the root
cgroup first. After that, "isolated" can be written into
"cpuset.cpus.partition" of the partition root to form a remote
isolated partition which is the only supported remote partition
type for now.
All remote partitions are terminal as adjacent partition cannot
be created underneath it.
Can you elaborate this extra restriction a bit further?
Are you referring to the fact that only remote isolated partitions are
supported? I do not preclude the support of load balancing remote
partitions. I keep it to isolated partitions for now for ease of
implementation and I am not currently aware of a use case where such a
remote partition type is needed.
If you are talking about remote partition being terminal. It is mainly
because it can be more tricky to support hierarchical adjacent
partitions underneath it especially if it is not isolated. We can
certainly support it if a use case arises. I just don't want to
implement code that nobody is really going to use.
BTW, with the current way the remote partition is created, it is not
possible to have another remote partition underneath it.
In general, I think it'd be really helpful if the document explains the
reasoning behind the design decisions. ie. Why is reserving for? What
purpose does it serve that the regular isolated ones cannot? That'd help
clarifying the design decisions.
I understand your concern. If you think it is better to support both
types of remote partitions or hierarchical adjacent partitions
underneath it for symmetry purpose, I can certain do that. It just needs
to take a bit more time.
Cheers,
Longman