Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 14 Feb 2020 14:11:26 -0500

On 14-Feb-2020 02:39:31 PM, Thomas Gleixner wrote:
> The required protection is that the caller cannot be migrated to a
> different CPU as these places take either a hash bucket lock or might
> trigger a kprobe inside the memory allocator. Both scenarios can lead to
> deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
> variable which temporarily blocks the invocation of BPF programs from perf
> and kprobes.
> 
> Replace the preempt_disable/enable() pairs with migrate_disable/enable()
> pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
> kernel this maps to preempt_disable/enable(), i.e. no functional change.

Will that _really_ work on RT ?

I'm puzzled about what will happen in the following scenario on RT:

Thread A is preempted within e.g. htab_elem_free_rcu, and Thread B is
scheduled and runs through a bunch of tracepoints. Both are on the
same CPU's runqueue:

CPU 1

Thread A is scheduled
(Thread A) htab_elem_free_rcu()
(Thread A)   migrate disable
(Thread A)   __this_cpu_inc(bpf_prog_active); -> per-cpu variable for
                                               deadlock prevention.
Thread A is preempted
Thread B is scheduled
(Thread B) Runs through various tracepoints:
           trace_call_bpf()
           if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
               -> will skip any instrumentation that happens to be on
                  this CPU until...
Thread B is preempted
Thread A is scheduled
(Thread A)  __this_cpu_dec(bpf_prog_active);
(Thread A)  migrate enable

Having all those events randomly and silently discarded might be quite
unexpected from a user standpoint. This turns the deadlock prevention
mechanism into a random tracepoint-dropping facility, which is
unsettling. One alternative approach we could consider to solve this
is to make this deadlock prevention nesting counter per-thread rather
than per-cpu.

Also, I don't think using __this_cpu_inc() without preempt-disable or
irq off is safe. You'll probably want to move to this_cpu_inc/dec
instead, which can be heavier on some architectures.

Thanks,

Mathieu

> 
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
>  kernel/bpf/hashtab.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -698,11 +698,11 @@ static void htab_elem_free_rcu(struct rc
>  	 * we're calling kfree, otherwise deadlock is possible if kprobes
>  	 * are placed somewhere inside of slub
>  	 */
> -	preempt_disable();
> +	migrate_disable();
>  	__this_cpu_inc(bpf_prog_active);
>  	htab_elem_free(htab, l);
>  	__this_cpu_dec(bpf_prog_active);
> -	preempt_enable();
> +	migrate_enable();
>  }
>  
>  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> @@ -1327,7 +1327,7 @@ static int
>  	}
>  
>  again:
> -	preempt_disable();
> +	migrate_disable();
>  	this_cpu_inc(bpf_prog_active);
>  	rcu_read_lock();
>  again_nocopy:
> @@ -1347,7 +1347,7 @@ static int
>  		raw_spin_unlock_irqrestore(&b->lock, flags);
>  		rcu_read_unlock();
>  		this_cpu_dec(bpf_prog_active);
> -		preempt_enable();
> +		migrate_enable();
>  		goto after_loop;
>  	}
>  
> @@ -1356,7 +1356,7 @@ static int
>  		raw_spin_unlock_irqrestore(&b->lock, flags);
>  		rcu_read_unlock();
>  		this_cpu_dec(bpf_prog_active);
> -		preempt_enable();
> +		migrate_enable();
>  		kvfree(keys);
>  		kvfree(values);
>  		goto alloc;
> @@ -1406,7 +1406,7 @@ static int
>  
>  	rcu_read_unlock();
>  	this_cpu_dec(bpf_prog_active);
> -	preempt_enable();
> +	migrate_enable();
>  	if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
>  	    key_size * bucket_cnt) ||
>  	    copy_to_user(uvalues + total * value_size, values,
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com