Re: [PATCH 4.4 043/146] sched/fair: Fix cfs_rq avg tracking underflow

bsegall@xxxxxxxxxx · Mon, 25 Jul 2016 15:12:24 -0700

Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> writes:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
>
> ------------------
>
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>
> commit 8974189222159154c55f24ddad33e3613960521a upstream.
>
> As per commit:
>
>   b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
>
>> the code generated from update_cfs_rq_load_avg():
>>
>> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
>> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
>> 		removed_load = 1;
>> 	}
>>
>> turns into:
>>
>> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
>> ffffffff8108706b:       48 85 c0                test   %rax,%rax
>> ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
>> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
>> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
>> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
>> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
>> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
>> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
>> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>>
>> Which you'll note ends up with sa->load_avg -= r in memory at
>> ffffffff8108707a.
>
> So I _should_ have looked at other unserialized users of ->load_avg,
> but alas. Luckily nikbor reported a similar /0 from task_h_load() which
> instantly triggered recollection of this here problem.
>
> Aside from the intermediate value hitting memory and causing problems,
> there's another problem: the underflow detection relies on the signed
> bit. This reduces the effective width of the variables, IOW its
> effectively the same as having these variables be of signed type.
>
> This patch changes to a different means of unsigned underflow
> detection to not rely on the signed bit. This allows the variables to
> use the 'full' unsigned range. And it does so with explicit LOAD -
> STORE to ensure any intermediate value will never be visible in
> memory, allowing these unserialized loads.
>
> Note: GCC generates crap code for this, might warrant a look later.
>
> Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
>        maybe we should do clamping on add too.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Cc: Andrey Ryabinin <aryabinin@xxxxxxxxxxxxx>
> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Yuyang Du <yuyang.du@xxxxxxxxx>
> Cc: bsegall@xxxxxxxxxx
> Cc: kernel@xxxxxxxx
> Cc: morten.rasmussen@xxxxxxx
> Cc: pjt@xxxxxxxxxx
> Cc: steve.muckle@xxxxxxxxxx
> Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
> Link: http://lkml.kernel.org/r/20160617091948.GJ30927@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
>
> ---
>  kernel/sched/fair.c |   33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2682,6 +2682,23 @@ static inline void update_tg_load_avg(st
>  
>  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>  
> +/*
> + * Unsigned subtract and clamp on underflow.
> + *
> + * Explicitly do a load-store to ensure the intermediate value never hits
> + * memory. This allows lockless observations without ever seeing the negative
> + * values.
> + */
> +#define sub_positive(_ptr, _val) do {				\
> +	typeof(_ptr) ptr = (_ptr);				\
> +	typeof(*ptr) val = (_val);				\
> +	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
> +	res = var - val;					\
> +	if (res > var)						\
> +		res = 0;					\
> +	WRITE_ONCE(*ptr, res);					\
> +} while (0)
> +
>  /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
>  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> @@ -2690,15 +2707,15 @@ static inline int update_cfs_rq_load_avg
>  
>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> -		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> -		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> +		sub_positive(&sa->load_avg, r);
> +		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
>  		removed = 1;
>  	}
>  
>  	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
>  		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
> -		sa->util_avg = max_t(long, sa->util_avg - r, 0);
> -		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
> +		sub_positive(&sa->util_avg, r);
> +		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
>  	}
>  
>  	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> @@ -2764,10 +2781,10 @@ static void detach_entity_load_avg(struc
>  			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
>  			  cfs_rq->curr == se, NULL);
>  
> -	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
> -	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
> -	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
> -	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
> +	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
> +	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
> +	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> +	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>  }
>  
>  /* Add the load generated by se into cfs_rq's load average */

I missed this the first time around, and I have no problem with this
backport, but it's not remotely obvious that update_cfs_rq_h_load() is
intended to be something approaching racing-safe when not under
rq->lock. (And given only somewhat-adversarial compilers I agree that it
probably won't do any worse than skip updates, though I certainly won't
swear to it)
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html