Re: [PATCH v12 8/8] cgroup: implement the PIDs subsystem

Tejun Heo <tj@xxxxxxxxxx> · Thu, 28 May 2015 16:33:12 -0400

Hello, sorry about the delay.

On Tue, May 19, 2015 at 12:56:31PM +0200, Peter Zijlstra wrote:
> > This has been discussed before. Organisational operations (i.e.
> > attaching to a cgroup) are not to be blocked by a cgroup controller in
> > the unified hierarchy. 
> 
> That's utterly insane. As argued at length in threads like:
> 
>   lkml.kernel.org/r/alpine.DEB.2.11.1505061100040.4225@nanos
> 
> This breaks fundamental control rules and makes life for a number of
> controllers impossible.

I didn't chase that dicussion because it was rather off-topic for
scheduler.

There are several classes of distribution schemes that cgroups deal
with.

A. Ratio-based.  Usually used to distributed resources which are
   replenished over time.  IO time, CPU cycles and so on.  This
   primarily doesn't deal with persistent state.

B. Limiting over-committable resources.  This applies to persistent
   resources like memory but also to transient ones like IO bandwidth
   and iops.  These all operate by limiting how much resources are
   newly given out and thus their neutral state is the overcommitted
   no-limit state.

C. Non-over-committable "hard" resources.  Currently, scheduler RT
   slices are the only one.  These actually should be distributed by
   carving out a finite whole and thus its limits can't be
   over-committed.  They have to behave as allocators rather than
   limiters.

Most persistent resources fall in the B category and we have a very
clear precedences in dealing with configurations of these limits.
Just think about the NPROC ulimit or quota.  They all operate by
suppressing distribution of new resources and allow new limit
configuration to be lower than the current consumption.

There's a clear reason for this.  it allows closing the race window
between configuration change and increasing resource consumption in a
very simple way - lowering the limit and checking the existing usage.

While what Thomas suggested - building a whole new transaction model
on top - can also close the race window.  This breaks from the
convention for no good reason.  It doesn't provide anything beyond
what's what's possible with the established model and it's outright
silly to have NPROC controller to behave so differently from the
existing mechanism which controls exactly the same resource.

> Also, I'll NAK each and every patch that will attempt to remove failing
> can_attach from the cgroup core as it will fundamentally break some
> scheduler controllers.

I was struggling with C above because it was just a single resource
type which belongs to that category but given that cgroups have to
support it ->can_attach() will have to be able to fail for those
resource types, but only for that resource type.

> So please use it, it doesn't make any bloody sense to 'control' the
> number of PIDs but then allow it to overrun the set point.

Again, it's not about ->can_attach() can fail or not in terms of
implementation at all.  It's about following consistent resource
distribution model.  Please don't conflate different resource types.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html