xfs_alloc_ag_vextent_near() takes minutes to complete

"Alex Lyakas" <alex@xxxxxxxxxxxxxxxxx> · Mon, 1 May 2017 15:37:23 +0300

[resending to proper list]

Greetings XFS community,

We have an XFS instance mounted as "sync"
(rw,sync,noatime,wsync,attr2,discard,inode64,prjquota 0 0). This instance is
exposed via nfsd. Size is about 50TB.

root@0000000f:/home/alex # xfs_info /export/share1
meta-data=/dev/dm-64             isize=256    agcount=49, agsize=268435455
blks
        =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=13107200000, imaxpct=5
        =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
        =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

We observe the following issue:

nfsd thread comes into xfs_file_write_iter()=>xfs_file_buffered_aio_write(),
which copies the data into page cache. Then xfs_file_write_iter() calls
generic_write_sync(), which ends up working for minutes in stack like [1].
It seems to be scanning the free space btrees, calling xfs_btree_increment()
and sometimes xfs_btree_decrement().

This whole operation is performed while i_lock is exclusively locked by
xfs_iomap_write_allocate(), before calling xfs_bmapi_write(). So i_lock is
being held for minutes.

Meanwhile, another nfsd thread calls into the same inode, and  reaches
xfs_file_buffered_aio_write(), and calls xfs_rw_ilock(XFS_IOLOCK_EXCL),
which locks both i_mutex and i_iolock. Now this thread calls
xfs_file_aio_write_checks() which wants to grab the i_lock in stack like
[2]. So now all three locks (i_mutex, i_lock and i_iolock) are taken.

Now another nfsd thread calling into xfs_file_buffered_aio_write() will get
stuck waiting for i_mutex. Or any other thread that needs one of these locks
will get stuck. This in some cases triggers hung-task panic.

After allocation completes, I look at the inode in question via xfs_db, and
I see that this file is not heavily fragmented. It has few 10s of extents
only. So file fragmentation is not an issue.

Questions:
1) The thread in [1] does not seem to be waiting for IO. It wants to lock a
metadata buffer, which is done before spawning a IO on it (to my
understanding). What is it waiting for?

2) Any recommendations what to do in such situation? Will running xfs_fsr
help? (although the files in question are not heavily fragmented)

Kernel version is 3.18.19.

BTW, in the code I see the following comment:
/*
* Locking primitives for read and write IO paths to ensure we consistently
use
* and order the inode->i_mutex, ip->i_lock and ip->i_iolock.
*/

But the above analysis clearly shows that this locking order is violated.
xfs_file_buffered_aio_write()  grabs the i_mutex and the i_iolock. And only
then it wants to grab the i_lock (indirectly). Is the comment wrong or is
the code wrong? I looked at kernel 4.10, and it seems to have the same
discrepancy between the comment and the code.

Thanks,
Alex.

[1]
[<ffffffff810b90f1>] down+0x41/0x50
[<ffffffffc0d79090>] xfs_buf_lock+0x40/0x120 [xfs]
[<ffffffffc0d792cd>] _xfs_buf_find+0x15d/0x2e0 [xfs]
[<ffffffffc0d7947a>] xfs_buf_get_map+0x2a/0x210 [xfs]
[<ffffffffc0d7a1dc>] xfs_buf_read_map+0x2c/0x190 [xfs]
[<ffffffffc0dac621>] xfs_trans_read_buf_map+0x261/0x490 [xfs]
[<ffffffffc0d50a89>] xfs_btree_read_buf_block.constprop.28+0x69/0xa0 [xfs]
[<ffffffffc0d5340e>] xfs_btree_increment+0x1ee/0x2e0 [xfs]
[<ffffffffc0d384ea>] xfs_alloc_ag_vextent_near+0x58a/0x1e20 [xfs]
[<ffffffffc0d39ea5>] xfs_alloc_ag_vextent+0xd5/0x120 [xfs]
[<ffffffffc0d3b65f>] xfs_alloc_vextent+0x48f/0x690 [xfs]
[<ffffffffc0d4c4ff>] xfs_bmap_btalloc+0x39f/0x700 [xfs]
[<ffffffffc0d4c884>] xfs_bmap_alloc+0x24/0x40 [xfs]
[<ffffffffc0d4d601>] xfs_bmapi_write+0x811/0xe00 [xfs]
[<ffffffffc0d8b0c5>] xfs_iomap_write_allocate+0x155/0x4f0 [xfs]
[<ffffffffc0d7159e>] xfs_map_blocks+0x1de/0x250 [xfs]
[<ffffffffc0d72e15>] xfs_vm_writepage+0x1a5/0x630 [xfs]
[<ffffffff8117c377>] __writepage+0x17/0x50
[<ffffffff8117ccb8>] write_cache_pages+0x238/0x4c0
[<ffffffff8117cf80>] generic_writepages+0x40/0x60
[<ffffffffc0d71c93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8117e7ae>] do_writepages+0x1e/0x40
[<ffffffff81173429>] __filemap_fdatawrite_range+0x59/0x60
[<ffffffff8117352a>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffc0d80af7>] xfs_file_fsync+0x57/0x220 [xfs]
[<ffffffff8121823b>] vfs_fsync_range+0x1b/0x30
[<ffffffffc0d8238b>] xfs_file_write_iter+0xeb/0x130 [xfs]
[<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
[<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
[<ffffffff811e8e39>] vfs_writev+0x39/0x50
[<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
[<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
[<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
[<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
[<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]
[<ffffffffc0263dc0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<ffffffffc0264157>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffc0302837>] nfsd+0x127/0x1a0 [nfsd]
[<ffffffff810911b9>] kthread+0xc9/0xe0
[<ffffffff81717918>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

[2]
[<ffffffffc0c80db1>] xfs_ilock+0x321/0x330 [xfs]
[<ffffffffc0c7e5dd>] xfs_vn_update_time+0x6d/0x1b0 [xfs]
[<ffffffff81202c15>] update_time+0x25/0xc0
[<ffffffff810d7c76>] ? current_fs_time+0x16/0x60
[<ffffffff81202e70>] file_update_time+0x80/0xd0
[<ffffffffc0c74e60>] xfs_file_aio_write_checks+0x140/0x1a0 [xfs]
[<ffffffffc0c74fac>] xfs_file_buffered_aio_write.isra.11+0xec/0x390 [xfs]
[<ffffffff811e7430>] ? new_sync_read+0xb0/0xb0
[<ffffffffc0c752d3>] xfs_file_write_iter+0x83/0x130 [xfs]
[<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
[<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
[<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
[<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
[<ffffffff81341e43>] ? apparmor_cred_prepare+0x33/0x50
[<ffffffff811e8e39>] vfs_writev+0x39/0x50
[<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
[<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
[<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
[<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
[<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html