Re: RFC: documentation of the autogroup feature [v2]

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Fri, 25 Nov 2016 17:04:56 +0100

On Fri, Nov 25, 2016 at 04:04:25PM +0100, Michael Kerrisk (man-pages) wrote:
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │How do the nice value of  a  process  and  the  nice │
> >>        │value of an autogroup interact? Which has priority?  │
> >>        │                                                     │
> >>        │It  *appears*  that the autogroup nice value is used │
> >>        │for CPU distribution between task groups,  and  that │
> >>        │the  process nice value has no effect there.  (I.e., │
> >>        │suppose two  autogroups  each  contain  a  CPU-bound │
> >>        │process,  with  one  process  having nice==0 and the │
> >>        │other having nice==19.  It appears  that  they  each │
> >>        │get  50%  of  the CPU.)  It appears that the process │
> >>        │nice value has effect only with respect to  schedul‐ │
> >>        │ing  relative to other processes in the *same* auto‐ │
> >>        │group.  Is this correct?                             │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Yup, entity nice level affects distribution among peer entities.
> 
> Huh! I only just learned about this via my experiments while
> investigating autogroups. 
> 
> How long have things been like this? Always? (I don't think
> so.) Since the arrival of CFS? Since the arrival of
> autogrouping? (I'm guessing not.) Since some other point?
> (When?)

Ever since cfs-cgroup, this is a fundamental design point of cgroups,
and has therefore always been the case for autogroups (as that is
nothing more than an application of the cgroup code).

> It seems to me that this renders the traditional process
> nice pretty much useless. (I bet I'm not the only one who'd 
> be surprised by the current behavior.)

Its really rather fundamental to how the whole hierarchical things
works.

CFS is a weighted fair queueing scheduler; this means each entity
receives:

               w_i
  dt_i = dt --------
	    \Sum w_j

		CPU
	  ______/ \______
	 /    |     |	 \
        A     B     C     D

So if each entity {A,B,C,D} has equal weight, then they will receive
equal time. Explicitly, for C you get:

                      w_C
  dt_C = dt -----------------------
            (w_A + w_B + w_C + w_D)

Extending this to a hierarchy, we get:

		CPU
	  ______/ \______
	 /    |     |	 \
        A     B     C     D
	           / \
		  E   F

Where C becomes a 'server' for entities {E,F}. The weight of C does not
depend on its child entities. This way the time of {E,F} becomes a
straight product of their ratio with C. That is; the whole thing
becomes, where l denotes the level in the hierarchy and i an
entity on that level:

                 l      w_g,i
  dt_l,i = dt \Prod  ----------
                g=0  \Sum w_g,j

Or more concretely, for E:

                      w_E
  dt_1,E = dt_0,C -----------
                  (w_E + w_F)

                        w_C               w_E
         = dt ----------------------- -----------
              (w_A + w_B + w_C + w_D) (w_E + w_F)

And this 'trivially' extends to SMP, with the tricky bit being that the
sums over all entities end up being machine wide, instead of per CPU,
which is a real and royal pain for performance.

Note that this property, where the weight of the server entity is
independent from its child entities is a desired feature. Without that
it would be impossible to control the relative weights of groups, and
that is the sole parameter of the WFQ model.

It is also why Linus so likes autogroups, each session competes equally
amongst one another.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html