Re: XFS AIL lockup

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 2 Oct 2017 09:49:04 +1100

On Sun, Oct 01, 2017 at 03:10:03PM -0700, Sargun Dhillon wrote:
> I'm running into an issue where xfs aild is locking up. This is on
> kernel version 4.9.34. It's an SMP system with 32 cores, and ~250G of
> RAM (AWS R4.8XL) and an XFS filesystem with 1 SSD with project ID
> quotas in use. It's the only XFS filesystem on the host. The root
> partition is running EXT4, and isn't involved in this.
> 
> There are containers that use overlayfs atop this filesystem. It looks
> like one of the processes (10090, or 11504) has gotten into a state
> where it's holding a lock on a xfs_buf, and they're trying to lock
> xfs_buf's which are currently on the xfs ail list.
> 
> xfs_info:
> (root) ~ # xfs_info /mnt
> meta-data=/dev/xvdb              isize=512    agcount=4, agsize=33554432 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=134217728, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=65536, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The stacks of the locked up processes are as follows:
> (root) ~ # cat /proc/10090/stack
> [<ffffffffad2d0981>] down+0x41/0x50
> [<ffffffffc164051c>] xfs_buf_lock+0x3c/0xf0 [xfs]
> [<ffffffffc1640735>] _xfs_buf_find+0x165/0x340 [xfs]
> [<ffffffffc164093a>] xfs_buf_get_map+0x2a/0x280 [xfs]
> [<ffffffffc16415bd>] xfs_buf_read_map+0x2d/0x180 [xfs]
> [<ffffffffc1675f75>] xfs_trans_read_buf_map+0xf5/0x330 [xfs]
> [<ffffffffc1625659>] xfs_read_agi+0x99/0x130 [xfs]
> [<ffffffffc16530b2>] xfs_iunlink_remove+0x62/0x370 [xfs]
> [<ffffffffc16571dc>] xfs_rename+0x7cc/0xb90 [xfs]
> [<ffffffffc1651096>] xfs_vn_rename+0xd6/0x150 [xfs]
> [<ffffffffad444268>] vfs_rename+0x758/0x980
> [<ffffffffc01a8e17>] ovl_do_rename+0x37/0xa0 [overlay]
> [<ffffffffc01a9e8b>] ovl_rename2+0x65b/0x720 [overlay]
> [<ffffffffad444268>] vfs_rename+0x758/0x980
> [<ffffffffad4487ef>] SyS_rename+0x39f/0x3c0
> [<ffffffffad203b8b>] do_syscall_64+0x5b/0xc0
> [<ffffffffada091ef>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

Ok, this is a RENAME_WHITEOUT case, and that points to the issue.
The whiteout inode is allocated as a temporary inode, which means
it remains on the unlinked list so that if we crash part way through
the update log recovery will free it again.

Once all the dirent updates and other rename work is done, we remove
the whiteout inode from the unlinked list, and that requires
grabbing the AGI lock. That's what we are stuck on here.

> (root) ~ # cat /proc/1107/stack
> [<ffffffffc1674894>] xfsaild+0xe4/0x730 [xfs]
> [<ffffffffad2a5886>] kthread+0xe6/0x100
> [<ffffffffada093b5>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

The AIL and it's behaviour is irrelevant here.

> (root) ~ # cat /proc/11504/stack
> [<ffffffffad2d0981>] down+0x41/0x50
> [<ffffffffc164051c>] xfs_buf_lock+0x3c/0xf0 [xfs]
> [<ffffffffc1640735>] _xfs_buf_find+0x165/0x340 [xfs]
> [<ffffffffc164093a>] xfs_buf_get_map+0x2a/0x280 [xfs]
> [<ffffffffc16415bd>] xfs_buf_read_map+0x2d/0x180 [xfs]
> [<ffffffffc1675f75>] xfs_trans_read_buf_map+0xf5/0x330 [xfs]
> [<ffffffffc15f1a36>] xfs_read_agf+0x96/0x120 [xfs]
> [<ffffffffc15f1b09>] xfs_alloc_read_agf+0x49/0x140 [xfs]
> [<ffffffffc15f1f5d>] xfs_alloc_fix_freelist+0x35d/0x3b0 [xfs]
> [<ffffffffc15f22f4>] xfs_alloc_vextent+0x2e4/0x640 [xfs]
> [<ffffffffc16243a8>] xfs_ialloc_ag_alloc+0x1a8/0x760 [xfs]
> [<ffffffffc1626173>] xfs_dialloc+0x173/0x260 [xfs]
> [<ffffffffc1652951>] xfs_ialloc+0x71/0x580 [xfs]
> [<ffffffffc1654e53>] xfs_dir_ialloc+0x73/0x200 [xfs]
> [<ffffffffc1655459>] xfs_create+0x479/0x720 [xfs]
> [<ffffffffc16524b7>] xfs_generic_create+0x217/0x2f0 [xfs]
> [<ffffffffc16525c4>] xfs_vn_mknod+0x14/0x20 [xfs]
> [<ffffffffc1652603>] xfs_vn_create+0x13/0x20 [xfs]
> [<ffffffffad442727>] vfs_create+0x127/0x190
> [<ffffffffc01a932d>] ovl_create_real+0xad/0x230 [overlay]
> [<ffffffffc01aa539>] ovl_create_or_link.part.5+0x119/0x6f0 [overlay]
> [<ffffffffc01aac0a>] ovl_create_object+0xfa/0x110 [overlay]
> [<ffffffffc01aacd3>] ovl_create+0x23/0x30 [overlay]
> [<ffffffffad445808>] path_openat+0x1378/0x1440
> [<ffffffffad446b91>] do_filp_open+0x91/0x100
> [<ffffffffad433d74>] do_sys_open+0x124/0x210
> [<ffffffffad433e7e>] SyS_open+0x1e/0x20
> [<ffffffffad203b8b>] do_syscall_64+0x5b/0xc0
> [<ffffffffada091ef>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff

Because this is the deadlock - we're trying to lock the AGF with an
AGI already locked. That means the above RENAME_WHITEOUT has either
allocated or freed extents in manipulating the dirents during
rename, and so holds an AGF locked. It's a classic ABBA deadlock.

That's the problem, not sure what the solution is yet - there's no
obvious or simple way around this RENAME_WHITEOUT behaviour (which
only affects overlay, fwiw). I'll have a think about it.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html