Re: [PATCH bpf-next] bpf: use raw_spin_trylock() for pcpu_freelist_push/pop in NMI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/26/20 2:07 AM, Song Liu wrote:
Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
pcpu_freelist in NMI:

./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi

[   18.984807] ================================
[   18.984807] WARNING: inconsistent lock state
[   18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
[   18.984809] --------------------------------
[   18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
[   18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at:
__pcpu_freelist_pop+0xe3/0x180
[   18.984813] {INITIAL USE} state was registered at:
[   18.984814]   lock_acquire+0x175/0x7c0
[   18.984814]   _raw_spin_lock+0x2c/0x40
[   18.984815]   __pcpu_freelist_pop+0xe3/0x180
[   18.984815]   pcpu_freelist_pop+0x31/0x40
[   18.984816]   htab_map_alloc+0xbbf/0xf40
[   18.984816]   __do_sys_bpf+0x5aa/0x3ed0
[   18.984817]   do_syscall_64+0x2d/0x40
[   18.984818]   entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   18.984818] irq event stamp: 12
[ ... ]
[   18.984822] other info that might help us debug this:
[   18.984823]  Possible unsafe locking scenario:
[   18.984823]
[   18.984824]        CPU0
[   18.984824]        ----
[   18.984824]   lock(&head->lock);
[   18.984826]   <Interrupt>
[   18.984826]     lock(&head->lock);
[   18.984827]
[   18.984828]  *** DEADLOCK ***
[   18.984828]
[   18.984829] 2 locks held by test_progs/1990:
[ ... ]
[   18.984838]  <NMI>
[   18.984838]  dump_stack+0x9a/0xd0
[   18.984839]  lock_acquire+0x5c9/0x7c0
[   18.984839]  ? lock_release+0x6f0/0x6f0
[   18.984840]  ? __pcpu_freelist_pop+0xe3/0x180
[   18.984840]  _raw_spin_lock+0x2c/0x40
[   18.984841]  ? __pcpu_freelist_pop+0xe3/0x180
[   18.984841]  __pcpu_freelist_pop+0xe3/0x180
[   18.984842]  pcpu_freelist_pop+0x17/0x40
[   18.984842]  ? lock_release+0x6f0/0x6f0
[   18.984843]  __bpf_get_stackid+0x534/0xaf0
[   18.984843]  bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
[   18.984844]  bpf_overflow_handler+0x12f/0x3f0

This is because pcpu_freelist_head.lock is accessed in both NMI and
non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.

For systems with only one cpu, there is a trickier scenario with
pcpu_freelist_push(): if the only pcpu_freelist_head.lock is already
locked before NMI, raw_spin_trylock() will never succeed. Unlike,
_pop(), where we can failover and return NULL, failing _push() will leak
memory. Fix this issue with an extra list, pcpu_freelist.extralist. The
extralist is primarily used to take _push() when raw_spin_trylock()
failed on all the per cpu lists. It should be empty most of the time.
The following table summarizes the behavior of pcpu_freelist in NMI
and non-NMI:

non-NMI pop(): 	use _lock(); check per cpu lists first;
                 if all per cpu lists are empty, check extralist;
                 if extralist is empty, return NULL.

non-NMI push(): use _lock(); only push to per cpu lists.

NMI pop():    use _trylock(); check per cpu lists first;
               if all per cpu lists are locked or empty, check extralist;
               if extralist is locked or empty, return NULL.

NMI push():   use _trylock(); check per cpu lists first;
               if all per cpu lists are locked; try push to extralist;
               if extralist is also locked, keep trying on per cpu lists.

Code looks reasonable to me, is there any practical benefit to keep the
extra list around for >1 CPU case (and not just compile it out)? For
example, we could choose a different back end *_freelist_push/pop()
implementation depending on CONFIG_SMP like ...

ifeq ($(CONFIG_SMP),y)
obj-$(CONFIG_BPF_SYSCALL) += percpu_freelist.o
else
obj-$(CONFIG_BPF_SYSCALL) += onecpu_freelist.o
endif

... and keep the CONFIG_SMP simplified in that we'd only do the trylock
iteration over CPUs under NMI with pop aborting with NULL in worst case
and push keep looping, whereas for the single CPU case, all the logic
resides in onecpu_freelist.c and it has a simpler two list implementation?

Thanks,
Daniel



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux