On 14-Feb-2020 02:39:31 PM, Thomas Gleixner wrote: > The required protection is that the caller cannot be migrated to a > different CPU as these places take either a hash bucket lock or might > trigger a kprobe inside the memory allocator. Both scenarios can lead to > deadlocks. The deadlock prevention is per CPU by incrementing a per CPU > variable which temporarily blocks the invocation of BPF programs from perf > and kprobes. > > Replace the preempt_disable/enable() pairs with migrate_disable/enable() > pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT > kernel this maps to preempt_disable/enable(), i.e. no functional change. Will that _really_ work on RT ? I'm puzzled about what will happen in the following scenario on RT: Thread A is preempted within e.g. htab_elem_free_rcu, and Thread B is scheduled and runs through a bunch of tracepoints. Both are on the same CPU's runqueue: CPU 1 Thread A is scheduled (Thread A) htab_elem_free_rcu() (Thread A) migrate disable (Thread A) __this_cpu_inc(bpf_prog_active); -> per-cpu variable for deadlock prevention. Thread A is preempted Thread B is scheduled (Thread B) Runs through various tracepoints: trace_call_bpf() if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { -> will skip any instrumentation that happens to be on this CPU until... Thread B is preempted Thread A is scheduled (Thread A) __this_cpu_dec(bpf_prog_active); (Thread A) migrate enable Having all those events randomly and silently discarded might be quite unexpected from a user standpoint. This turns the deadlock prevention mechanism into a random tracepoint-dropping facility, which is unsettling. One alternative approach we could consider to solve this is to make this deadlock prevention nesting counter per-thread rather than per-cpu. Also, I don't think using __this_cpu_inc() without preempt-disable or irq off is safe. You'll probably want to move to this_cpu_inc/dec instead, which can be heavier on some architectures. Thanks, Mathieu > > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > --- > kernel/bpf/hashtab.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -698,11 +698,11 @@ static void htab_elem_free_rcu(struct rc > * we're calling kfree, otherwise deadlock is possible if kprobes > * are placed somewhere inside of slub > */ > - preempt_disable(); > + migrate_disable(); > __this_cpu_inc(bpf_prog_active); > htab_elem_free(htab, l); > __this_cpu_dec(bpf_prog_active); > - preempt_enable(); > + migrate_enable(); > } > > static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) > @@ -1327,7 +1327,7 @@ static int > } > > again: > - preempt_disable(); > + migrate_disable(); > this_cpu_inc(bpf_prog_active); > rcu_read_lock(); > again_nocopy: > @@ -1347,7 +1347,7 @@ static int > raw_spin_unlock_irqrestore(&b->lock, flags); > rcu_read_unlock(); > this_cpu_dec(bpf_prog_active); > - preempt_enable(); > + migrate_enable(); > goto after_loop; > } > > @@ -1356,7 +1356,7 @@ static int > raw_spin_unlock_irqrestore(&b->lock, flags); > rcu_read_unlock(); > this_cpu_dec(bpf_prog_active); > - preempt_enable(); > + migrate_enable(); > kvfree(keys); > kvfree(values); > goto alloc; > @@ -1406,7 +1406,7 @@ static int > > rcu_read_unlock(); > this_cpu_dec(bpf_prog_active); > - preempt_enable(); > + migrate_enable(); > if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys, > key_size * bucket_cnt) || > copy_to_user(uvalues + total * value_size, values, > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com