PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may case non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. For our use case, we also want leaf cgroup PSI accounted for userspace adjustment on that cgroup, apart from only system-wide management. So this patch add kernel cmdline parameter "psi_inner_cgroup" to control whether or not to account for inner cgroups, which is default to true for compatibility. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. default (psi_inner_cgroup=true) $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec 2. psi_inner_cgroup=false $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.309 [sec] 7.309436 usecs/op 136809 ops/sec Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ kernel/sched/psi.c | 11 ++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8090130b544b..6beef5b8bc36 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4419,6 +4419,12 @@ tracking. Format: <bool> + psi_inner_cgroup= + [KNL] Enable or disable pressure stall information + tracking for the inner cgroups. + Format: <bool> + default: enabled + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 2228cbf3bdd3..8d76920f47b3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -147,12 +147,21 @@ static bool psi_enable; #else static bool psi_enable = true; #endif + +static bool psi_inner_cgroup __read_mostly = true; + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) == 0; } __setup("psi=", setup_psi); +static int __init setup_psi_inner_cgroup(char *str) +{ + return kstrtobool(str, &psi_inner_cgroup) == 0; +} +__setup("psi_inner_cgroup=", setup_psi_inner_cgroup); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) group_init(&cgroup->psi); parent = cgroup_parent(cgroup); - if (parent && cgroup_parent(parent)) + if (parent && cgroup_parent(parent) && psi_inner_cgroup) cgroup->psi.parent = cgroup_psi(parent); else cgroup->psi.parent = &psi_system; -- 2.36.1