Re: [Documentation] State of CPU controller in cgroup v2

Tejun Heo <tj@xxxxxxxxxx> · Sat, 3 Sep 2016 18:05:26 -0400

Hello, Andy.

On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
> > Consider a use case where the user isn't interested in fully
> > accounting and dividing up system resources but wants to just cap
> > resource usage from a subset of workloads.  There is no reason to
> > require such usages to fully contain all processes in non-root
> > cgroups.  Furthermore, it's not trivial to migrate all processes out
> > of root to a sub-cgroup unless the agent is in full control of boot
> > process.
> 
> Then please also consider exactly the same use case while running in a
> container.
> 
> I'm a bit frustrated that you're saying that my example failure modes
> consist of shooting oneself in the foot and then you go on to come up
> with your own examples that have precisely the same problem.

You have a point, which is

  The system-root and namespace-roots are not symmetric.

and that's a valid concern.  Here's why the system-root is special.

* A system has entities and resource consumptions which can only be
  attributed to the "system".  The system-root is the natural place to
  put them.  The system-root has stuff no other cgroups, not even
  namespace-roots, have.  It's a unique situation.

* The need to bypass most cgroup related overhead when not in use.
  The system-root is there whether cgroup is actally in use or not and
  thus can not impose noticeable overhead.  It has to make sense for
  both resource-controlled systems as well as ones that aren't.
  Again, no other group has these requirements.

  Note that this means that all controllers should be able to and
  already allow uncontained consumptions in the system-root.  I'll
  come back to this later.

Now, due to the various issues with direct competition between
processes and cgroups, cgroup v2 disallows resource control across
them (the no-internal-tasks restriction); however, cgroup v2 currently
doesn't apply the restriction to the system-root.  Here are the
reasons.

* It doesn't bring any practical benefits in terms of implementation.
  As noted above, all controllers already have to allow uncontained
  consumptions in the system-root and that's the only attribute
  required for the exemption.

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.

* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.

On this subject, your only actual point is that there is an asymmetry
and that's bothersome.  I've been trying to explain why the special
case doesn't actually get in the way in terms of implementation or
capability and is actually beneficial.  Instead of engaging in the
actual discussion, you're constantly coming up with different ways of
saying "it's not symmetric".

The system-root and namespace-roots aren't equivalent.  There are a
lot of parallels between system-root and namescope-root but they
aren't the same thing (e.g. bootstrapping a namespace is a less
complicated and more malleable process).  The system-root is not even
a fully qualified node of the resource graph.

It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?

> > I have, multiple times.  Can you please read 2-1-2 of the document in
> > the original post and take the discussion from there?
> 
> I've read it multiple times, and I don't see any explanation that's
> consistent with the fact that you are exempting the root cgroup from
> this constraint.  If the constraint were really critical to everything
> working, then I would expect the root cgroup to have exactly the same
> problem.  This makes me think that either something nasty is being
> fudged for the root cgroup or that the constraint isn't actually so
> important after all.  The only thing on point I can find is:
> 
> > Root cgroup is exempt from this constraint, which is in line with
> > how root cgroup is handled in general - it's excluded from cgroup
> > resource accounting and control.
> 
> and that's not very helpful.

My apologies.  I somehow thought that was part of the documentation.
Will update it later, but here's an excerpt from my earlier response.

  Having a special case doesn't necessarily get in the way of
  benefiting from a set of general rules.  The root cgroup is
  inherently special as it has to be the catch-all scope for entities
  and resource consumptions which can't be tied to any specific
  consumer - irq handling, packet rx, journal writes, memory reclaim
  from global memory pressure and so on.  None of sub-cgroups have to
  worry about them.

  These base-system operations are special regardless of cgroup and we
  already have sometimes crude ways to affect their behaviors where
  necessary through sysctl knobs, priorities on specific kernel
  threads and so on.  cgroup doesn't change the situation all that
  much.  What gets left in the root cgroup usually are the base-system
  operations which are outside the scope of cgroup resource control in
  the first place and cgroup resource graph can treat the root as an
  opaque anchor point.

  There can be other ways to deal with the issue; however, treating
  root cgroup this way has the big advantage of minimizing the gap
  between configurations without and with cgroups both in terms of
  mental model and implementation.

  Hopefully, the case of a namespace root is clear now.  If it's gonna
  have a sub-hierarchy, it itself can't contain processes but the
  system root just contains base-system entities and resources which a
  namespace root doesn't have to worry about.  Ignoring base-system
  stuff, a namespace root is topologically in the same position as the
  system root in the cgroup resource graph.

Maybe this wasn't as clear as I thought it was.  I hope the earlier
part of this message is enough of a clarification.

> >> Also, here's an idea to maybe make PeterZ happier: relax the
> >> restriction a bit per-controller.  Currently (except for /), if you
> >> have subtree control enabled you can't have any processes in the
> >> cgroup.  Could you change this so it only applies to certain
> >> controllers?  If the cpu controller is entirely happy to have
> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
> >> subtree control enabled could allow processes to exist.
> >
> > The document lists several reasons for not doing this and also that
> > there is no known real world use case for such configuration.

So, up until this point, we were talking about no-internal-tasks
constraint.

> My company's production workload would map quite nicely to this
> relaxed model.  I have quite a few processes each with several
> threads.  Some of those threads get some CPUs, some get other CPUs,
> and they vary in what shares of what CPUs they get.  To be clear,
> there is not a hierarchy of resource usage that's compatible with the
> process hierarchy.  Multiple processes have threads that should be
> grouped in a different place in the hierarchy than other threads.
> Concretely, I have processes A and B with threads A1, A2, B1, and B2.
> (And many more, but this is enough to get the point across.)  The
> natural grouping is:
> 
> Group 1: A1 and B1
> Group 2: A2
> Group 3: B2

And now you're talking about process granularity.

> This cannot be expressed with rgroup or with cgroup2.  cgroup1 has no
> problem with it.  If I were using memcg, I would want to have a memcg
> hierarchy that was incompatible with the hierarchy above, so I
> actually find the cgroup2 insistence on a unified hierarchy to be a
> bit annoying, but I at least understand the motivation behind the
> unified hierarchy.
> 
> And I don't care that the system controller can't atomically move this
> whole mess around.  I'm currently running without systemd, so I don't

I do.  It's a horrible userland API to expose to individual
applications if the organization that a given application expects can
be disturbed by system operations.  Imagine how this would be
documented - "if this operation races with system operation, it may
return -ENOENT.  Repeating the path lookup might make the operation
succeed again."

> *have* a system controller.  If I end up migrating to systemd, I'll
> probably put this whole pile into its own slice and manage it
> manually.

Yeah, systemd has delegation feature for cases like that which we
depend on too.

As for your example, who performs the cgroup setup and configuration,
the application itself or an external entity?  If an external entity,
how does it know which thread is what?

And, as for rgroup not covering it, would extending rgroup to cover
multi-process cases be enough or are there more fundamental issues?

> > Yeap, the name collisions suck.  I thought about disallowing all
> > sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
> > non-trivial chance of breaking users which were happy before when a
> > new controller gets added.  But, yeah, we at least should disallow the
> > known filenames.  Will think more about it.
> 
> How about disallowing names that contain a '.'?

That's guaranteed to break things left and right, and, given how
departed it is from what has been all along including v1, it'd be an
actually gratuitous painful change.  While name collisions is a nasty
possibility, it seldom is a practical problem as most use naming
schemes which are unlikely to actually collide.  Even "$SUBSYS." is
likely too broad.  Most cures seem worse than the disease here.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html