On Sun, May 26, 2019 at 8:32 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, May 24, 2019 at 01:34:58PM -0700, Jeffrey Baker wrote: > > On Tue, May 21, 2019 at 3:49 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Tue, May 21, 2019 at 09:21:10AM -0700, Jeffrey Baker wrote: > > > > I have a problem of recurring hangs on machines where I get hung task > > > > warnings for kswapd and many user threads stuck in lstat. At the time, > > > > I/O grinds to a halt, although not quite to zero. These hangs happen > > > > 1-2 times per day on a fleet of several thousand machines. We'd like > > > > to understand the root cause, if it is already known, so we can figure > > > > out the minimum kernel to which we might want to update. > > > > > > > > These are the hung task stacks: > > > > > > > > kswapd0 D 0 279 2 0x00000000 > > > > Call Trace: > > > > __schedule+0x232/0x700 > > > > schedule+0x36/0x80 > > > > schedule_preempt_disabled+0xe/0x10 > > > > __mutex_lock_slowpath+0x193/0x290 > > > > mutex_lock+0x2f/0x40 > > > > xfs_reclaim_inodes_ag+0x288/0x330 [xfs] > > > > > > You're basically running the machine out of memory and there > > > are so many direct reclaimers that all the inode reclaim parallelism in > > > the filesystem has been exhausted and it's blocking waiting for > > > other reclaim to complete. > > > > Sort of. "Out of memory" here means > 2GB free, admittedly that's > > less than 1% of the machine. > > By "out of memory" I meant that your machine appears to have been at > or below the memory reclaim watermarks for an extended period of > time, not that there is no free memory at all. Indeed, we always have one NUMA zone at the low water mark, although we have zone reclaim disabled. > > > > > kswapd1 D 0 280 2 0x00000000 > > > > Call Trace: > > > > __schedule+0x232/0x700 > > > > schedule+0x36/0x80 > > > > schedule_timeout+0x235/0x3f0 > > > > ? blk_finish_plug+0x2c/0x40 > > > > ? _xfs_buf_ioapply+0x334/0x460 [xfs] > > > > wait_for_completion+0xb4/0x140 > > > > ? wake_up_q+0x70/0x70 > > > > ? xfs_bwrite+0x24/0x60 [xfs] > > > > xfs_buf_submit_wait+0x7f/0x210 [xfs] > > > > xfs_bwrite+0x24/0x60 [xfs] > > > > xfs_reclaim_inode+0x313/0x340 [xfs] > > > > xfs_reclaim_inodes_ag+0x208/0x330 [xfs] > > > > > > Yup, memory reclaim is pushing so hard it is doing direct writeback > > > of dirty inodes. > > > > OK. Is that reflected in vmstat somewhere? I see > > "nr_vmscan_immediate_reclaim" but it's not obvious to me what that is, > > and it wasn't increasing at the moment of this event. > > Nope, not for low level filesystem inode reclaim like this. Getting > the filesystem to do writeback from the inode shrinker is relatively > uncommon. > > > > > All other hung threads are stuck in the third stack. > > > > > > > > We are using the Ubuntu 16.04 kernel, 4.10.0-40-generic > > > > #44~16.04.1-Ubuntu. The machines involved have 20-core / 40-thread > > > > Intel CPUs, 384 GiB of main memory, and four nvme devices in an md > > > > RAID 0. The filesystem info is: > > > > > > > > # xfs_info /dev/md0 > > > > meta-data=/dev/md0 isize=256 agcount=6, agsize=268435455 blks > > > > = sectsz=512 attr=2, projid32bit=0 > > > > = crc=0 finobt=0 spinodes=0 > > > > data = bsize=4096 blocks=1562366976, imaxpct=5 > > > > = sunit=0 swidth=0 blks > > FWIW, I just noticed you don't have a sunit/swidth set for you > md RAID 0, which means you only have 6 AGs rather than 32. So > there's a lot less allocation and reclaim concurrency in this > filesystem that we'd normally see for a 4x RAID0 stripe.... To some extent this has been a forensic exercise for me to figure out how and why these filesystems were deployed like this. I think it was believed at the time that su/sw has a magic automatic value of 0. > > > > This looks to me like something below the filesystem choking up and > > > grinding to a halt. What are all the nvme drives doing over this > > > period? > > > > Write rates on the nvme drives are all exactly the md0 rates / 4, so > > that seems normal. > > Write rates aren't that important - what do the io latencies, queue > depths and device utilisation figures look like? 10s of microseconds, ~zero, and ~zero respectively. > > > Hard to know what is going on at this point, though, but nothing > > > at the filesystem or memory reclaim level should be stuck on IO for > > > long periods of time on nvme SSDs... > > > > > > /me wonders if 4.10 had the block layer writeback throttle code in > > > it, and if it does whether that is what has gone haywire here. > > > > That is an interesting point. I do see this on a different, healthy box > > > > # cat /sys/kernel/debug/bdi/9\:0/stats > > BdiWriteback: 4320 kB > > BdiReclaimable: 2304 kB > > BdiDirtyThresh: 4834552 kB > > DirtyThresh: 4834552 kB > > BackgroundThresh: 2414324 kB > > BdiDirtied: 5316993504 kB > > BdiWritten: 4732763040 kB > > BdiWriteBandwidth: 6060 kBps > > That's the currently used write bandwidth of the backing device. The > block layer writeback throttle is below this. i.e. This simply > reflects the rate at which data is currently being written to the > block device. > > > One hopes that no throttling is based on this estimate of 6MB/s since > > this array is capable of more like 10000MB/s. At this time we're > > focusing on system tunables that might prevent the machine from > > getting into serious trouble with dirty pages or reclaim or whatever. > > I know XFS has only a handful of tunables. Do you think it's > > recommendable to lower the xfssynd timer from 30s to something else? > > Our workload _is_ metadata-heavy; I see about 100 create/remove ops > > per second > > FWIW, that's not what I'd call metadata heavy. XFS on a 4-SSD array with > 20 CPU cores can sustain somewhere around 300,000 create/removes a > second before it runs out of CPU. 100/s is barely a blip above > idle.... Quite. That's part of the mystery: why the machine derps to a halt while also barely accessing its abilities. What I meant, though, was that the metadata rate is fairly high in my experience in proportion to the amount of other I/O, which is also pretty low here. > > > and the log stats are rapidly increasing. By contrast we're > > not really pushing the extent allocator at all. > > > > extent_alloc 10755989 500829395 3030459 4148636182 > > abt 0 0 0 0 > > blk_map 3012410686 1743133993 196385287 18176676 100618849 659563291 0 > > bmbt 0 0 0 0 > > dir 112134 99359626 99358346 49393915 > > Ok, so 100M files created and removed since boot. > > > trans 98 4036033830 0 > > All async transactions. > > > ig 99320520 98796299 5890 524221 0 523020 140279 > > log 1339041912 2184819896 0 1341870387 1341704946 > writes blocks | forces force_waits > noiclogs > > Ok, so why are the so many log forces (1.34 /billion/)? There's > roughly a log force in every 3rd transaction, but they aren't from > the transaction commit (because they are async transactions). > > > push_ail 4109383649 0 37995385 2614017 0 136962 25152 3301200 0 47933 > try_logspace | pushes success | pinned | flushing | flush > sleep_logspace pushbuf locked restarts > > They aren't coming from the metadata writeback code (pinned count), > and the log is not blocking on free space. i.e. there's no metadata > writeback bottleneck occurring. > > > xstrat 589377 0 > > rw 3561425929 1342108567 > > Oh, 1.34 billion writes. You're doing O_DSYNC writes, yes? And lots > of them to every file that is created? > > > attr 3297942617 0 0 0 > > icluster 1452111 945722 3364798 > > vnodes 1201 0 0 0 99307909 99307909 99307909 0 > > buf 779252853 2349573 776924344 978078 41302 2328509 0 2491287 1090 > > abtb2 13886237 90593055 270338 266221 17 15 4555 1481 1188 2529 30 22 > > 47 37 93926009 > > abtc2 28760802 203677378 13739372 13735284 45 43 3270 1292 1935 1691 > > 57 50 102 93 1413028741 > > bmbt2 14355745 140867292 7158285 7115081 8 2 4074 4122 25396 4074 230 > > 48 238 50 6974741 > > ibt2 397071627 776203283 67846 67820 0 0 0 0 0 0 0 0 0 0 413 > > For 100M files created/removed, to only see ~68000 inode btree > record inserts and removes implies that the filesystem is > efficiently reusing the freed inodes. i.e. there's pretty much a > steady state of inodes in use in the workload.... > > > fibt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > rmapbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > refcntbt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > qm 0 0 0 0 0 0 0 0 > > xpc 17666569097216 111928912437311 80399355018480 > write_bytes read_bytes > > Assuming this is the only substantially active XFS filesystem on the > system, that's 101TiB written to those drives, in 1.34 billion writes, > which is an average of 82KiB per write. Assuming that the 32 bit > write counter has not wrapped. Given that the transaction counter is > nearly at the wrap point, and the read counter is at 3.5B for ~75TiB > (average 22KiB), let's assume that the write counter has wrapped > and we have ~5.5 billion writes. That gives an average write of just > on 20KiB which kinda matches the read.... > > So the drives look like they are being constantly hammered with > small, synchronous IOs. This means write amplification is likely to > be a substantial problem for them. They've had a /lot/ of data > written to them and are under constant load so there's no time for > idle cleanup, so is it possible that the drives themselves are > having internal garbage collection related slowdowns? That's certainly not impossible, although these are Very Fancy(tm) devices. I'll have a look at the logs. Thanks again, jwb > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx