On 1/2/20 1:24 PM, Darrick J. Wong wrote: > On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote: >> >>> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@xxxxxxxxxx> wrote: >>> >>> Depending on the workloads, the following circular locking dependency >>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo >>> lock) may show up: >>> >>> ====================================================== >>> WARNING: possible circular locking dependency detected >>> 5.0.0-rc1+ #60 Tainted: G W >>> ------------------------------------------------------ >>> fsfreeze/4346 is trying to acquire lock: >>> 0000000026f1d784 (fs_reclaim){+.+.}, at: >>> fs_reclaim_acquire.part.19+0x5/0x30 >>> >>> but task is already holding lock: >>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >>> >>> which lock already depends on the new lock. >>> : >>> Possible unsafe locking scenario: >>> >>> CPU0 CPU1 >>> ---- ---- >>> lock(sb_internal); >>> lock(fs_reclaim); >>> lock(sb_internal); >>> lock(fs_reclaim); >>> >>> *** DEADLOCK *** >>> >>> 4 locks held by fsfreeze/4346: >>> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 >>> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 >>> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 >>> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >>> >>> stack backtrace: >>> Call Trace: >>> dump_stack+0xe0/0x19a >>> print_circular_bug.isra.10.cold.34+0x2f4/0x435 >>> check_prev_add.constprop.19+0xca1/0x15f0 >>> validate_chain.isra.14+0x11af/0x3b50 >>> __lock_acquire+0x728/0x1200 >>> lock_acquire+0x269/0x5a0 >>> fs_reclaim_acquire.part.19+0x29/0x30 >>> fs_reclaim_acquire+0x19/0x20 >>> kmem_cache_alloc+0x3e/0x3f0 >>> kmem_zone_alloc+0x79/0x150 >>> xfs_trans_alloc+0xfa/0x9d0 >>> xfs_sync_sb+0x86/0x170 >>> xfs_log_sbcount+0x10f/0x140 >>> xfs_quiesce_attr+0x134/0x270 >>> xfs_fs_freeze+0x4a/0x70 >>> freeze_super+0x1af/0x290 >>> do_vfs_ioctl+0xedc/0x16c0 >>> ksys_ioctl+0x41/0x80 >>> __x64_sys_ioctl+0x73/0xa9 >>> do_syscall_64+0x18f/0xd23 >>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> According to Dave Chinner: >>> >>> Freezing the filesystem, after all the data has been cleaned. IOWs >>> memory reclaim will never run the above writeback path when >>> the freeze process is trying to allocate a transaction here because >>> there are no dirty data pages in the filesystem at this point. >>> >>> Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that >>> it /doesn't deadlock/ by taking freeze references for the >>> transaction. We've just drained all the transactions >>> in progress and written back all the dirty metadata, too, and so the >>> filesystem is completely clean and only needs the superblock to be >>> updated to complete the freeze process. And to do that, it does not >>> take a freeze reference because calling sb_start_intwrite() here >>> would deadlock. >>> >>> IOWs, this is a false positive, caused by the fact that >>> xfs_trans_alloc() is called from both above and below memory reclaim >>> as well as within /every level/ of freeze processing. Lockdep is >>> unable to describe the staged flush logic in the freeze process that >>> prevents deadlocks from occurring, and hence we will pretty much >>> always see false positives in the freeze path.... >>> >>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock >>> may fix the issue. However, that will greatly complicate the logic and >>> may not be worth it. >>> >>> Another way to fix it is to disable the taking of the fs_reclaim >>> pseudo lock when in the freezing code path as a reclaim on the freezed >>> filesystem is not possible as stated above. This patch takes this >>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory >>> allocation calls when the filesystem has been freezed. >>> >>> Without this patch, the command sequence below will show that the lock >>> dependency chain sb_internal -> fs_reclaim exists. >>> >>> # fsfreeze -f /home >>> # fsfreeze --unfreeze /home >>> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal >>> >>> After applying the patch, such sb_internal -> fs_reclaim lock dependency >>> chain can no longer be found. Because of that, the locking dependency >>> warning will not be shown. >> There was an attempt to fix this in the past, but Dave rejected right >> away for any workaround in xfs and insisted to make lockdep smarter >> instead. No sure your approach will make any difference this time. >> Good luck. > /me wonders if you can fix this by having the freeze path call > memalloc_nofs_save() since we probably don't want to be recursing into > the fs for reclaim while freezing it? Probably not, because that's a > bigger hammer than we really need here. We can certainly steal memory > from other filesystems that aren't frozen. > > It doesn't solve the key issue that lockdep isn't smart enough to know > that we can't recurse into the fs that's being frozen and therefore > there's no chance of deadlock. Lockdep only looks at all the possible locking chains to see if a circular deadlock is possible. It doesn't have the smart to understand filesystem internals. The problem here is caused by the fact that fs_reclaim is a global pseudo lock that is acquired whenever there is a chance that FS reclaim can happen. As I said in the commit log, it may be possible to fix that by breaking up fs_reclaim into a set of per-filesystem pseudo locks, but that will add quite a bit of complexity to the code. That is why I don't want to go this route. This patch is the least invasive that I can think of to address the problem without inhibiting other valid lockdep checking. Cheers, Longman