From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> Commit: 57430218317e ("sched/cputime: Count actually elapsed irq & softirq time") ... triggered a regression: | An i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four | cpu hog processes(for loop) running in the guest, I hot-unplug the pCPUs | on host one by one until there is only one left, then observe the top in | guest, there are 100% st for cpu0(housekeeping), and 75% st for other cpus | (nohz full mode). However, w/o this commit, 75% for all the four cpus. When a guest is interrupted for a longer amount of time, missed clock ticks are not redelivered later. Because of that, we should not limit the amount of steal time accounted to the amount of time that the calling functions think have passed. However, the interval returned by account_other_time() is NOT rounded down to the nearest jiffy, while the base interval in get_vtime_delta() it is subtracted from is, so the max cputime limit is required to avoid underflow. This patch fix the regression by limiting the account_other_time() from get_vtime_delta() to avoid underflow, and let other three call sites (account_other_time() and steal_account_process_time()) account however much steal time the host told us elapsed. Suggested-by: Rik van Riel <riel@xxxxxxxxxx> Suggested-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> Cc: Radim Krcmar <rkrcmar@xxxxxxxxxx> Cc: Mike Galbraith <efault@xxxxxx> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> --- v1 -> v2: * add code comments and update the changelog kernel/sched/cputime.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 9858266..e52cf7f 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime) cpustat[CPUTIME_IDLE] += (__force u64) cputime; } +/* + * When a guest is interrupted for a longer amount of time, missed clock + * ticks are not redelivered later. Due to that, this function may on + * occasion account more time than the calling functions think elapsed. + */ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime) { #ifdef CONFIG_PARAVIRT @@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, * idle, or potentially user or system time. Due to rounding, * other time can exceed ticks occasionally. */ - other = account_other_time(cputime); + other = account_other_time(ULONG_MAX); if (other >= cputime) return; cputime -= other; @@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick) } cputime = cputime_one_jiffy; - steal = steal_account_process_time(cputime); + steal = steal_account_process_time(ULONG_MAX); if (steal >= cputime) return; @@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks) } cputime = jiffies_to_cputime(ticks); - steal = steal_account_process_time(cputime); + steal = steal_account_process_time(ULONG_MAX); if (steal >= cputime) return; @@ -694,6 +699,12 @@ static cputime_t get_vtime_delta(struct task_struct *tsk) unsigned long now = READ_ONCE(jiffies); cputime_t delta, other; + /* + * The interval returned by account_other_time() is NOT + * rounded down to the nearest jiffy, while the base + * interval it is subtracted from is. So the max cputime + * limit is required to avoid underflow. + */ delta = jiffies_to_cputime(now - tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html