Recurring hand in XFS inode reclaim on 4.10

Jeffrey Baker <jwbaker@xxxxxxxxx> · Tue, 21 May 2019 09:21:10 -0700

I have a problem of recurring hangs on machines where I get hung task
warnings for kswapd and many user threads stuck in lstat. At the time,
I/O grinds to a halt, although not quite to zero. These hangs happen
1-2 times per day on a fleet of several thousand machines. We'd like
to understand the root cause, if it is already known, so we can figure
out the minimum kernel to which we might want to update.

These are the hung task stacks:

kswapd0         D    0   279      2 0x00000000
Call Trace:
 __schedule+0x232/0x700
 schedule+0x36/0x80
 schedule_preempt_disabled+0xe/0x10
 __mutex_lock_slowpath+0x193/0x290
 mutex_lock+0x2f/0x40
 xfs_reclaim_inodes_ag+0x288/0x330 [xfs]
 ? enqueue_entity+0x118/0x640
 ? check_preempt_curr+0x54/0x90
 ? ttwu_do_wakeup+0x19/0xe0
 ? ttwu_do_activate+0x6f/0x80
 ? try_to_wake_up+0x59/0x3e0
 ? wake_up_process+0x15/0x20
 xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
 xfs_fs_free_cached_objects+0x19/0x20 [xfs]
 super_cache_scan+0x191/0x1a0
 shrink_slab.part.40+0x1fa/0x430
 shrink_slab+0x29/0x30
 shrink_node+0x108/0x320
 kswapd+0x34b/0x720
 kthread+0x109/0x140
 ? mem_cgroup_shrink_node+0x180/0x180
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x2c/0x40

kswapd1         D    0   280      2 0x00000000
Call Trace:
 __schedule+0x232/0x700
 schedule+0x36/0x80
 schedule_timeout+0x235/0x3f0
 ? blk_finish_plug+0x2c/0x40
 ? _xfs_buf_ioapply+0x334/0x460 [xfs]
 wait_for_completion+0xb4/0x140
 ? wake_up_q+0x70/0x70
 ? xfs_bwrite+0x24/0x60 [xfs]
 xfs_buf_submit_wait+0x7f/0x210 [xfs]
 xfs_bwrite+0x24/0x60 [xfs]
 xfs_reclaim_inode+0x313/0x340 [xfs]
 xfs_reclaim_inodes_ag+0x208/0x330 [xfs]
 ? sched_clock+0x9/0x10
 ? sched_clock_cpu+0x8f/0xa0
 ? try_to_wake_up+0x59/0x3e0
 ? wake_up_process+0x15/0x20
 xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
 xfs_fs_free_cached_objects+0x19/0x20 [xfs]
 super_cache_scan+0x191/0x1a0
 shrink_slab.part.40+0x1fa/0x430
 shrink_slab+0x29/0x30
 shrink_node+0x108/0x320
 kswapd+0x34b/0x720
 kthread+0x109/0x140
 ? mem_cgroup_shrink_node+0x180/0x180
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x2c/0x40

mysqld          D    0 89015 116527 0x00000080
Call Trace:
 __schedule+0x232/0x700
 ? __remove_hrtimer+0x3c/0x70
 schedule+0x36/0x80
 rwsem_down_read_failed+0xf9/0x150
 ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
 call_rwsem_down_read_failed+0x18/0x30
 ? xfs_trans_roll+0x40/0x40 [xfs]
 down_read+0x20/0x40
 xfs_ilock+0xe5/0x110 [xfs]
 xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
 xfs_attr_get+0xd3/0x180 [xfs]
 xfs_xattr_get+0x4b/0x70 [xfs]
 __vfs_getxattr+0x57/0x70
 get_vfs_caps_from_disk+0x59/0x100
 audit_copy_inode+0x6d/0xb0
 __audit_inode+0x1bb/0x320
 filename_lookup+0x128/0x180
 ? __check_object_size+0x108/0x1e3
 ? path_get+0x27/0x30
 ? __audit_getname+0x96/0xb0
 user_path_at_empty+0x36/0x40
 vfs_fstatat+0x66/0xc0
 SYSC_newlstat+0x31/0x60
 ? syscall_trace_enter+0x1d9/0x2f0
 ? __audit_syscall_exit+0x230/0x2c0
 SyS_newlstat+0xe/0x10
 do_syscall_64+0x5b/0xc0
 entry_SYSCALL64_slow_path+0x25/0x25

All other hung threads are stuck in the third stack.

We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic
#44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread
Intel CPUs, 384 GiB of main memory, and four nvme devices in an md
RAID 0.  The filesystem info is:

# xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=6, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
         =                       crc=0        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=1562366976, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The tasks above were reported as hung at 22:51:55, so they'd been dead
since 22:49:55 at the latest.  The stats from around that time seem
pretty normal.

# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -m

22:48:02  memtotal memfree buffers cached dirty slabmem  swptotal swpfree _mem_
22:48:07   385598M   2199M    115M 23773M   14M   2435M        0M      0M
22:48:12   385598M   2151M    115M 23777M   15M   2435M        0M      0M
22:48:17   385598M   2222M    115M 23781M   16M   2435M        0M      0M
22:48:22   385598M   2213M    115M 23785M    5M   2435M        0M      0M
22:48:27   385598M   2198M    115M 23790M   11M   2435M        0M      0M
22:48:32   385598M   2209M    115M 23791M    9M   2435M        0M      0M
22:48:37   385598M   2205M    115M 23794M    0M   2435M        0M      0M
22:48:42   385598M   2199M    115M 23798M    2M   2435M        0M      0M
22:48:47   385598M   2119M    115M 23799M    1M   2435M        0M      0M
22:48:52   385598M   2106M    115M 23804M    4M   2450M        0M      0M
22:48:57   385598M   2095M    115M 23807M    4M   2450M        0M      0M
22:49:02   385598M   1997M    116M 23814M    6M   2450M        0M      0M
22:49:07   385598M   1966M    116M 23839M    4M   2451M        0M      0M
22:49:12   385598M   1989M    116M 23856M   22M   2451M        0M      0M
22:49:17   385598M   1797M    117M 23863M   29M   2452M        0M      0M
22:49:22   385598M   1920M    118M 23902M   82M   2451M        0M      0M
22:49:27   385598M   1875M    118M 23905M   83M   2451M        0M      0M
22:49:32   385598M   1794M    120M 23925M   94M   2458M        0M      0M
22:49:37   385598M   1707M    120M 23953M   35M   2459M        0M      0M
22:49:42   385598M   1618M    120M 23985M   66M   2460M        0M      0M
22:49:47   385598M   1535M    120M 24016M   95M   2460M        0M      0M
22:49:52   385598M   1499M    121M 24036M   15M   2460M        0M      0M
22:49:57   385598M   1440M    121M 24070M   48M   2461M        0M      0M
22:50:02   385598M   1368M    121M 24095M   73M   2462M        0M      0M
22:50:07   385598M   1355M    121M 24096M   73M   2461M        0M      0M
22:50:12   385598M   1368M    121M 24101M   77M   2462M        0M      0M
22:50:17   385598M   1362M    121M 24102M   78M   2461M        0M      0M
22:50:22   385598M   1339M    121M 24106M    1M   2461M        0M      0M
22:50:27   385598M   1329M    121M 24108M    2M   2461M        0M      0M
22:50:32   385598M   1354M    121M 24109M    2M   2461M        0M      0M
22:50:37   385598M   1337M    121M 24113M    6M   2461M        0M      0M
22:50:42   385598M   1316M    121M 24114M    6M   2461M        0M      0M
22:50:47   385598M   1309M    121M 24119M   11M   2461M        0M      0M
22:50:52   385598M   1310M    121M 24119M    2M   2461M        0M      0M
22:50:57   385598M   1305M    121M 24123M    6M   2461M        0M      0M
22:51:02   385598M   1289M    121M 24123M    7M   2461M        0M      0M
22:51:07   385598M   1286M    121M 24127M    8M   2462M        0M      0M
22:51:12   385598M   1312M    121M 24128M    8M   2462M        0M      0M
22:51:17   385598M   1291M    121M 24128M    9M   2462M        0M      0M
22:51:22   385598M   1270M    121M 24132M   13M   2462M        0M      0M
22:51:27   385598M   1283M    121M 24112M    1M   2462M        0M      0M
22:51:32   385598M   1279M    121M 24116M    5M   2462M        0M      0M
22:51:37   385598M   1273M    121M 24116M    6M   2462M        0M      0M
22:51:42   385598M   1275M    121M 24115M    4M   2462M        0M      0M
22:51:47   385598M   1302M    121M 24081M    6M   2462M        0M      0M
22:51:52   385598M   1289M    121M 24085M   10M   2462M        0M      0M
22:51:57   385598M   1285M    121M 24086M    3M   2462M        0M      0M

Something is obviously wrong with atop's analysis of md0 I/O rate
here, but you get the idea: I/O slides over a minute from a reasonable
rate to near zero, except for one big spike of output.

# atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f

22:48:02  disk           busy read/s KB/read  writ/s KB/writ avque avserv _mdd_
22:48:07  md0              0% 1208.4     6.9  2569.0     6.6   0.0   0.00 ms
22:48:12  md0              0% 1294.1     7.0  2642.5     6.6   0.0   0.00 ms
22:48:17  md0              0% 1289.0     6.9  2823.5     6.8   0.0   0.00 ms
22:48:22  md0              0% 1376.3     6.9  2662.6     6.7   0.0   0.00 ms
22:48:27  md0              0% 1332.3     6.8  2578.7     6.3   0.0   0.00 ms
22:48:32  md0              0% 1338.2     6.9  2601.7     6.5   0.0   0.00 ms
22:48:37  md0              0% 1133.7     6.8  3172.3     6.9   0.0   0.00 ms
22:48:42  md0              0% 1377.4     6.8  2609.3     6.8   0.0   0.00 ms
22:48:47  md0              0% 1353.4     6.9  2293.6     6.5   0.0   0.00 ms
22:48:52  md0              0% 1291.8     6.9  2200.5     6.3   0.0   0.00 ms
22:48:57  md0              0% 1332.2     7.0  3066.1     6.9   0.0   0.00 ms
22:49:02  md0              0% 1275.6     6.9  3021.2     6.8   0.0   0.00 ms
22:49:07  md0              0% 1306.6     6.4  4407.2    10.4   0.0   0.00 ms
22:49:12  md0              0% 1172.4     6.6  2740.4     7.1   0.0   0.00 ms
22:49:17  md0              0%  468.4     4.9  2769.7     6.6   0.0   0.00 ms
22:49:22  md0              0%  422.3     4.9  2950.9     7.0   0.0   0.00 ms
22:49:27  md0              0%  364.7     4.9  2744.3     7.0   0.0   0.00 ms
22:49:32  md0              0%  254.6     4.7  6271.1     8.8   0.0   0.00 ms
22:49:37  md0              0%  289.6     4.6  2108.7     6.8   0.0   0.00 ms
22:49:42  md0              0%  276.1     4.7  2278.3     6.7   0.0   0.00 ms
22:49:47  md0              0%  266.2     4.7  1990.3     7.0   0.0   0.00 ms
22:49:52  md0              0%  274.9     4.7  2051.3     7.0   0.0   0.00 ms
22:49:57  md0              0%  222.7     4.6  1770.4     7.1   0.0   0.00 ms
22:50:02  md0              0%  104.1     4.5 22339.5     9.5   0.0   0.00 ms
22:50:07  md0              0%  132.9     4.4  6208.9     9.1   0.0   0.00 ms
22:50:12  md0              0%  147.2     4.5   755.7     5.1   0.0   0.00 ms
22:50:17  md0              0%  127.0     4.4   722.6     5.3   0.0   0.00 ms
22:50:22  md0              0%  120.5     4.4   703.7     4.9   0.0   0.00 ms
22:50:27  md0              0%  106.5     4.5   681.7     5.0   0.0   0.00 ms
22:50:32  md0              0%  121.1     4.5   756.5     4.7   0.0   0.00 ms
22:50:37  md0              0%  131.9     4.4   696.7     4.8   0.0   0.00 ms
22:50:42  md0              0%  100.9     4.4   669.6     4.7   0.0   0.00 ms
22:50:47  md0              0%   72.2     4.4   691.1     6.7   0.0   0.00 ms
22:50:52  md0              0%    0.0     0.0   320.4     7.6   0.0   0.00 ms
22:50:57  md0              0%    0.2     4.0   115.1     6.5   0.0   0.00 ms
22:51:02  md0              0%    0.0     0.0    62.5     5.7   0.0   0.00 ms
22:51:07  md0              0%    0.2     4.0    23.0     4.2   0.0   0.00 ms
22:51:12  md0              0%    0.2     4.0    22.6     4.1   0.0   0.00 ms
22:51:17  md0              0%    0.0     0.0    22.4     4.1   0.0   0.00 ms
22:51:22  md0              0%    0.2     4.0    26.0     4.4   0.0   0.00 ms
22:51:27  md0              0%    0.0     0.0    25.4     5.6   0.0   0.00 ms
22:51:32  md0              0%    0.0     0.0    22.6     4.1   0.0   0.00 ms
22:51:37  md0              0%    0.0     0.0    21.5     4.1   0.0   0.00 ms
22:51:42  md0              0%    0.0     0.0    26.0     5.4   0.0   0.00 ms
22:51:47  md0              0%    0.0     0.0    22.2     4.1   0.0   0.00 ms
22:51:52  md0              0%    0.0     0.0    23.4     4.1   0.0   0.00 ms
22:51:57  md0              0%    0.2     4.0    26.0     5.6   0.0   0.00 ms

Any help would be appreciated.

-jwb