On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote: > I have a problem of recurring hangs on machines where I get hung task > warnings for kswapd and many user threads stuck in lstat. At the time, > I/O grinds to a halt, although not quite to zero. These hangs happen > 1-2 times per day on a fleet of several thousand machines. We'd like > to understand the root cause, if it is already known, so we can figure > out the minimum kernel to which we might want to update. > > These are the hung task stacks: > > kswapd0 D 0 279 2 0x00000000 > Call Trace: > __schedule+0x232/0x700 > schedule+0x36/0x80 > schedule_preempt_disabled+0xe/0x10 > __mutex_lock_slowpath+0x193/0x290 > mutex_lock+0x2f/0x40 > xfs_reclaim_inodes_ag+0x288/0x330 [xfs] You're basically running the machine out of memory and there are so many direct reclaimers that all the inode reclaim parallelism in the filesystem has been exhausted and it's blocking waiting for other reclaim to complete. > kswapd1 D 0 280 2 0x00000000 > Call Trace: > __schedule+0x232/0x700 > schedule+0x36/0x80 > schedule_timeout+0x235/0x3f0 > ? blk_finish_plug+0x2c/0x40 > ? _xfs_buf_ioapply+0x334/0x460 [xfs] > wait_for_completion+0xb4/0x140 > ? wake_up_q+0x70/0x70 > ? xfs_bwrite+0x24/0x60 [xfs] > xfs_buf_submit_wait+0x7f/0x210 [xfs] > xfs_bwrite+0x24/0x60 [xfs] > xfs_reclaim_inode+0x313/0x340 [xfs] > xfs_reclaim_inodes_ag+0x208/0x330 [xfs] Yup, memory reclaim is pushing so hard it is doing direct writeback of dirty inodes. > mysqld D 0 89015 116527 0x00000080 > Call Trace: > __schedule+0x232/0x700 > ? __remove_hrtimer+0x3c/0x70 > schedule+0x36/0x80 > rwsem_down_read_failed+0xf9/0x150 > ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs] > call_rwsem_down_read_failed+0x18/0x30 > ? xfs_trans_roll+0x40/0x40 [xfs] > down_read+0x20/0x40 > xfs_ilock+0xe5/0x110 [xfs] > xfs_ilock_attr_map_shared+0x34/0x40 [xfs] > xfs_attr_get+0xd3/0x180 [xfs] > xfs_xattr_get+0x4b/0x70 [xfs] > __vfs_getxattr+0x57/0x70 > get_vfs_caps_from_disk+0x59/0x100 > audit_copy_inode+0x6d/0xb0 > __audit_inode+0x1bb/0x320 > filename_lookup+0x128/0x180 Somebody else has the inode locked doing something with it, so the audit code has to wait for it to finish before it can continue. This likely has nothing to do with memory reclaim, and more likely is backed up behind a transaction or something doing IO.... > ? __check_object_size+0x108/0x1e3 > ? path_get+0x27/0x30 > ? __audit_getname+0x96/0xb0 > user_path_at_empty+0x36/0x40 > vfs_fstatat+0x66/0xc0 > SYSC_newlstat+0x31/0x60 > ? syscall_trace_enter+0x1d9/0x2f0 > ? __audit_syscall_exit+0x230/0x2c0 > SyS_newlstat+0xe/0x10 > do_syscall_64+0x5b/0xc0 > entry_SYSCALL64_slow_path+0x25/0x25 > > All other hung threads are stuck in the third stack. > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md > RAID 0. The filesystem info is: > > # xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=0 > = crc=0 finobt=0 spinodes=0 > data = bsize=4096 blocks=1562366976, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > The tasks above were reported as hung at 22:51:55, so they'd been dead > since 22:49:55 at the latest. The stats from around that time seem > pretty normal. So you've got processes stuck waiting a couple of minutes for IO on nvme drives? That doesn't sound like a filesystem problem - that sounds more like lost IOs, a hung MD array, or hardware that's gone AWOL.... > 22:48:02 memtotal memfree buffers cached dirty slabmem swptotal swpfree _mem_ > 22:48:07 385598M 2199M 115M 23773M 14M 2435M 0M 0M Nothing unusual there - still lots of reclaimable page cache, and some slabmem, too. > Something is obviously wrong with atop's analysis of md0 I/O rate > here, but you get the idea: I/O slides over a minute from a reasonable > rate to near zero, except for one big spike of output. > > # atopsar -r /var/log/atop/atop_20190517 -b 22:48:00 -e 22:53:00 -f > > 22:48:02 disk busy read/s KB/read writ/s KB/writ avque avserv _mdd_ > 22:48:07 md0 0% 1208.4 6.9 2569.0 6.6 0.0 0.00 ms > 22:48:12 md0 0% 1294.1 7.0 2642.5 6.6 0.0 0.00 ms > 22:48:17 md0 0% 1289.0 6.9 2823.5 6.8 0.0 0.00 ms > 22:48:22 md0 0% 1376.3 6.9 2662.6 6.7 0.0 0.00 ms > 22:48:27 md0 0% 1332.3 6.8 2578.7 6.3 0.0 0.00 ms > 22:48:32 md0 0% 1338.2 6.9 2601.7 6.5 0.0 0.00 ms > 22:48:37 md0 0% 1133.7 6.8 3172.3 6.9 0.0 0.00 ms > 22:48:42 md0 0% 1377.4 6.8 2609.3 6.8 0.0 0.00 ms > 22:48:47 md0 0% 1353.4 6.9 2293.6 6.5 0.0 0.00 ms > 22:48:52 md0 0% 1291.8 6.9 2200.5 6.3 0.0 0.00 ms > 22:48:57 md0 0% 1332.2 7.0 3066.1 6.9 0.0 0.00 ms > 22:49:02 md0 0% 1275.6 6.9 3021.2 6.8 0.0 0.00 ms > 22:49:07 md0 0% 1306.6 6.4 4407.2 10.4 0.0 0.00 ms > 22:49:12 md0 0% 1172.4 6.6 2740.4 7.1 0.0 0.00 ms > 22:49:17 md0 0% 468.4 4.9 2769.7 6.6 0.0 0.00 ms IO rates take a hit here, about 40s before the processes got stuck. > 22:49:22 md0 0% 422.3 4.9 2950.9 7.0 0.0 0.00 ms > 22:49:27 md0 0% 364.7 4.9 2744.3 7.0 0.0 0.00 ms > 22:49:32 md0 0% 254.6 4.7 6271.1 8.8 0.0 0.00 ms > 22:49:37 md0 0% 289.6 4.6 2108.7 6.8 0.0 0.00 ms > 22:49:42 md0 0% 276.1 4.7 2278.3 6.7 0.0 0.00 ms > 22:49:47 md0 0% 266.2 4.7 1990.3 7.0 0.0 0.00 ms > 22:49:52 md0 0% 274.9 4.7 2051.3 7.0 0.0 0.00 ms > 22:49:57 md0 0% 222.7 4.6 1770.4 7.1 0.0 0.00 ms Processes stick here just as the read rate takes another hit, and... > 22:50:02 md0 0% 104.1 4.5 22339.5 9.5 0.0 0.00 ms That's likely a burst of 8KB inode writes from the XFS inode reclaim code. > 22:50:07 md0 0% 132.9 4.4 6208.9 9.1 0.0 0.00 ms > 22:50:12 md0 0% 147.2 4.5 755.7 5.1 0.0 0.00 ms > 22:50:17 md0 0% 127.0 4.4 722.6 5.3 0.0 0.00 ms > 22:50:22 md0 0% 120.5 4.4 703.7 4.9 0.0 0.00 ms > 22:50:27 md0 0% 106.5 4.5 681.7 5.0 0.0 0.00 ms > 22:50:32 md0 0% 121.1 4.5 756.5 4.7 0.0 0.00 ms > 22:50:37 md0 0% 131.9 4.4 696.7 4.8 0.0 0.00 ms > 22:50:42 md0 0% 100.9 4.4 669.6 4.7 0.0 0.00 ms > 22:50:47 md0 0% 72.2 4.4 691.1 6.7 0.0 0.00 ms > 22:50:52 md0 0% 0.0 0.0 320.4 7.6 0.0 0.00 ms And a minute later everything has ground to a halt. > 22:50:57 md0 0% 0.2 4.0 115.1 6.5 0.0 0.00 ms > 22:51:02 md0 0% 0.0 0.0 62.5 5.7 0.0 0.00 ms > 22:51:07 md0 0% 0.2 4.0 23.0 4.2 0.0 0.00 ms > 22:51:12 md0 0% 0.2 4.0 22.6 4.1 0.0 0.00 ms > 22:51:17 md0 0% 0.0 0.0 22.4 4.1 0.0 0.00 ms > 22:51:22 md0 0% 0.2 4.0 26.0 4.4 0.0 0.00 ms > 22:51:27 md0 0% 0.0 0.0 25.4 5.6 0.0 0.00 ms > 22:51:32 md0 0% 0.0 0.0 22.6 4.1 0.0 0.00 ms > 22:51:37 md0 0% 0.0 0.0 21.5 4.1 0.0 0.00 ms > 22:51:42 md0 0% 0.0 0.0 26.0 5.4 0.0 0.00 ms > 22:51:47 md0 0% 0.0 0.0 22.2 4.1 0.0 0.00 ms > 22:51:52 md0 0% 0.0 0.0 23.4 4.1 0.0 0.00 ms > 22:51:57 md0 0% 0.2 4.0 26.0 5.6 0.0 0.00 ms This looks to me like something below the filesystem choking up and grinding to a halt. What are all the nvme drives doing over this period? Hard to know what is going on at this point, though, but nothing at the filesystem or memory reclaim level should be stuck on IO for long periods of time on nvme SSDs... /me wonders if 4.10 had the block layer writeback throttle code in it, and if it does whether that is what has gone haywire here. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx