Re: [PATCH] bcachefs: Use alloc_percpu_gfp to avoid deadlock

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 21 Feb 2025 08:21:04 +0100

On 2/21/25 03:46, Dennis Zhou wrote:
> Hello,
> 
> On Thu, Feb 20, 2025 at 03:37:26PM -0500, Kent Overstreet wrote:
>> On Thu, Feb 20, 2025 at 06:16:43PM +0100, Vlastimil Babka wrote:
>> > On 2/20/25 11:57, Alan Huang wrote:
>> > > Ping
>> > > 
>> > >> On Feb 12, 2025, at 22:27, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>> > >> 
>> > >> Adding pcpu people to the CC
>> > >> 
>> > >> On Wed, Feb 12, 2025 at 06:06:25PM +0800, Alan Huang wrote:
>> > >>> The cycle:
>> > >>> 
>> > >>> CPU0: CPU1:
>> > >>> bc->lock pcpu_alloc_mutex
>> > >>> pcpu_alloc_mutex bc->lock
>> > >>> 
>> > >>> Reported-by: syzbot+fe63f377148a6371a9db@xxxxxxxxxxxxxxxxxxxxxxxxx
>> > >>> Tested-by: syzbot+fe63f377148a6371a9db@xxxxxxxxxxxxxxxxxxxxxxxxx
>> > >>> Signed-off-by: Alan Huang <mmpgouride@xxxxxxxxx>
>> > >> 
>> > >> So pcpu_alloc_mutex -> fs_reclaim?
>> > >> 
>> > >> That's really awkward; seems like something that might invite more
>> > >> issues. We can apply your fix if we need to, but I want to hear with the
>> > >> percpu people have to say first.
>> > >> 
>> > >> ======================================================
>> > >> WARNING: possible circular locking dependency detected
>> > >> 6.14.0-rc2-syzkaller-00039-g09fbf3d50205 #0 Not tainted
>> > >> ------------------------------------------------------
>> > >> syz.0.21/5625 is trying to acquire lock:
>> > >> ffffffff8ea19608 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x293/0x1760 mm/percpu.c:1782
>> > >> 
>> > >> but task is already holding lock:
>> > >> ffff888051401c68 (&bc->lock){+.+.}-{4:4}, at: bch2_btree_node_mem_alloc+0x559/0x16f0 fs/bcachefs/btree_cache.c:804
>> > >> 
>> > >> which lock already depends on the new lock.
>> > >> 
>> > >> 
>> > >> the existing dependency chain (in reverse order) is:
>> > >> 
>> > >> -> #2 (&bc->lock){+.+.}-{4:4}:
>> > >>       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5851
>> > >>       __mutex_lock_common kernel/locking/mutex.c:585 [inline]
>> > >>       __mutex_lock+0x19c/0x1010 kernel/locking/mutex.c:730
>> > >>       bch2_btree_cache_scan+0x184/0xec0 fs/bcachefs/btree_cache.c:482
>> > >>       do_shrink_slab+0x72d/0x1160 mm/shrinker.c:437
>> > >>       shrink_slab+0x1093/0x14d0 mm/shrinker.c:664
>> > >>       shrink_one+0x43b/0x850 mm/vmscan.c:4868
>> > >>       shrink_many mm/vmscan.c:4929 [inline]
>> > >>       lru_gen_shrink_node mm/vmscan.c:5007 [inline]
>> > >>       shrink_node+0x37c5/0x3e50 mm/vmscan.c:5978
>> > >>       kswapd_shrink_node mm/vmscan.c:6807 [inline]
>> > >>       balance_pgdat mm/vmscan.c:6999 [inline]
>> > >>       kswapd+0x20f3/0x3b10 mm/vmscan.c:7264
>> > >>       kthread+0x7a9/0x920 kernel/kthread.c:464
>> > >>       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:148
>> > >>       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
>> > >> 
>> > >> -> #1 (fs_reclaim){+.+.}-{0:0}:
>> > >>       lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5851
>> > >>       __fs_reclaim_acquire mm/page_alloc.c:3853 [inline]
>> > >>       fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867
>> > >>       might_alloc include/linux/sched/mm.h:318 [inline]
>> > >>       slab_pre_alloc_hook mm/slub.c:4066 [inline]
>> > >>       slab_alloc_node mm/slub.c:4144 [inline]
>> > >>       __do_kmalloc_node mm/slub.c:4293 [inline]
>> > >>       __kmalloc_noprof+0xae/0x4c0 mm/slub.c:4306
>> > >>       kmalloc_noprof include/linux/slab.h:905 [inline]
>> > >>       kzalloc_noprof include/linux/slab.h:1037 [inline]
>> > >>       pcpu_mem_zalloc mm/percpu.c:510 [inline]
>> > >>       pcpu_alloc_chunk mm/percpu.c:1430 [inline]
>> > >>       pcpu_create_chunk+0x57/0xbc0 mm/percpu-vm.c:338
>> > >>       pcpu_balance_populated mm/percpu.c:2063 [inline]
>> > >>       pcpu_balance_workfn+0xc4d/0xd40 mm/percpu.c:2200
>> > >>       process_one_work kernel/workqueue.c:3236 [inline]
>> > >>       process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
>> > >>       worker_thread+0x870/0xd30 kernel/workqueue.c:3398
>> > >>       kthread+0x7a9/0x920 kernel/kthread.c:464
>> > >>       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:148
>> > >>       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
>> > 
>> > Seeing this as part of the chain (fs reclaim from a worker doing
>> > pcpu_balance_workfn) makes me think Michal's patch could be a fix to this:
>> > 
>> > https://lore.kernel.org/all/20250206122633.167896-1-mhocko@xxxxxxxxxx/
>> 
>> Thanks for the link - that does look like just the thing.
> 
> Sorry I missed the first email asking to weigh in.
> 
> Michal's problem is a little bit different than what's happening here.

Yes, but it's related enough. He mentions commit 28307d938fb2 and there you
find a similar kind of lockdep splat.

> He's having an issue where a alloc_percpu_gfp(NOFS/NOIO) is considered
> atomic and failing during probing. This is because we don't have enough
> percpu memory backed to fulfill the "atomic" requests.

That, and we don't allow NOFS/NOIO to take the pcpu_alloc_mutex to avoid
deadlock with pcpu_balance_workfn taking it and then doing __GFP_FS reclaim.

> Historically we've considered any allocation that's not GFP_KERNEL to be
> atomic. Here it seems like the alloc_percpu() behind the bc->lock()
> should have been an "atomic" allocation to prevent the lock cycle?

That's what the original mail/patch in this thread suggested:
https://lore.kernel.org/all/20250212100625.55860-1-mmpgouride@xxxxxxxxx/
Note it proposes GFP_NOWAIT, possibly GFP_NOFS would be enough, but then the
current implementation could would make it "atomic" anyway.

But then it could end up failing like the allocations that motivated
Michal's patch? So with Michal's approach we can avoid having to weaken
pcpu_alloc() callers like this. Yes it's counter-intuitive that we weaken a
kworker context instead, which normally has no restrictions. But it's not
weakened (NOIO) nearly as much as pcpu_alloc() users that are effectively
atomic when they can't take pcpu_alloc mutex at all.

> Thanks,
> Dennis
>