In record_times() we use 'now' and groupc->state_start to calculate the delta as bellow, delta = now - groupc->state_start; But note that groupc->state_start may be not initialized yet, IOW, the state_start may be 0 currently. If state_start is 0, this calculation is same with assigning the lower 32-bit of 'now' to delta, that is a random value. To fix this value, we should initialize groupc->state_start before. After we calculate the delta, we will assign 'now' to groupc->state_start then, groupc->state_start = now; This will cause the same issue if groupc->state_start will not be used in a long period. Let's take an example. We create a cgroup foo and run tasks in it. Some of these tasks enter into memstall and state_start is set. Then we move all of these tasks out of cgroup foo for more than (1 << 32) nsecs, and then move them in. That will cause the same issue as above. The root cause of these issues is that we don't initialize the state_start properly. To fix it, we should record how many tasks in this per cpu psi_group. If there's no task in it, we just set state_start and don't calculate the delta, that means it is the begin of the pressure. To avoid redundant calculating the total number of tasks in this per cpu psi_group, a new member 'total_tasks' is introduced in struct psi_group_cpu, which is the sum of array members in tasks[]. Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Cc: Daniel Drake <drake@xxxxxxxxxxxx> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> --- include/linux/psi_types.h | 2 ++ kernel/sched/psi.c | 13 ++++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 4b7258495a04..b42cbfdb15e9 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -69,6 +69,8 @@ struct psi_group_cpu { /* States of the tasks belonging to this group */ unsigned int tasks[NR_PSI_TASK_COUNTS]; + /* Sum of above array members */ + unsigned int total_tasks; /* Aggregate pressure state derived from the tasks */ u32 state_mask; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8f45cdb6463b..7061529dc406 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -690,7 +690,10 @@ static void psi_group_change(struct psi_group *group, int cpu, */ write_seqcount_begin(&groupc->seq); - record_times(groupc, cpu, false); + if (groupc->total_tasks) + record_times(groupc, cpu, false); + else + groupc->state_start = cpu_clock(cpu); for (t = 0, m = clear; m; m &= ~(1 << t), t++) { if (!(m & (1 << t))) @@ -703,11 +706,15 @@ static void psi_group_change(struct psi_group *group, int cpu, psi_bug = 1; } groupc->tasks[t]--; + groupc->total_tasks--; } - for (t = 0; set; set &= ~(1 << t), t++) - if (set & (1 << t)) + for (t = 0; set; set &= ~(1 << t), t++) { + if (set & (1 << t)) { groupc->tasks[t]++; + groupc->total_tasks++; + } + } /* Calculate state mask representing active states */ for (s = 0; s < NR_PSI_STATES; s++) { -- 2.18.2