Hello. > > > <snip> > > > > > > > > > > Other directories on the same filesystem seem fine as do other XFS > > > > > filesystems on the same system. > > > > > > > > The fact you mention other directories seems to work, and the first stack trace > > > > you posted, it sounds like you've been keeping a singe AG too busy to almost > > > > make it unusable. But, you didn't provide enough information we can really make > > > > any progress here, and to be honest I'm more inclined to point the finger to > > > > your MD device. > > > > > > Let's see if we can pinpoint something :-) > > > > > > > Can you describe your MD device? RAID array? What kind? How many disks? > > > > > > RAID6 8 disks. > > > > > > > > > What's your filesystem configuration? (xfs_info <mount point>) > > > > > > meta-data=/dev/md126 isize=512 agcount=32, agsize=43954432 blks > > > = sectsz=4096 attr=2, projid32bit=1 > > > = crc=1 finobt=1 spinodes=0 rmapbt=0 > > > = reflink=0 > > > data = bsize=4096 blocks=1406538240, imaxpct=5 > > > = sunit=128 swidth=768 blks > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > log =internal bsize=4096 blocks=521728, version=2 > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > ^^^^^^ This should have been > > configured to 8 blocks, not 1 > > > > > Yes there's more. See a slightly elided dmesg from a longer run on > > > https://sucs.org/~sits/test/kern-20191024.log.gz . > > > > At a first quick look, it looks like you are having lots of IO contention in the > > log, and this is slowing down the rest of the filesystem. What caught my > > Should it become so slow that a task freezes entirely and never > finishes? Once the problem hits it's not like anything makes any more > progress on those directories nor was there very much generating dirty > data. > I am not sure how long you waited until you assumed it 'never finishes' :), but... I've seen systems starve to death due large amounts of RMW being generated, mainly when disks are slow or problematic. Since you are using SSDs and not spindles, I wonder if the SSDs write cycle might play a role here, but it's just a question I'm asking myself, I am not sure if it can play a role in making things even worse. In the file you sent above, you basically have: xfs-sync worker thread, which forced a log flush, and now is sleeping waiting for the flush to complete: Oct 24 16:27:11 <host> kernel: [115151.674164] Call Trace: Oct 24 16:27:11 <host> kernel: [115151.674170] __schedule+0x24e/0x880 Oct 24 16:27:11 <host> kernel: [115151.674175] schedule+0x2c/0x80 Oct 24 16:27:11 <host> kernel: [115151.674178] schedule_timeout+0x1cf/0x350 Oct 24 16:27:11 <host> kernel: [115151.674184] ? sched_clock+0x9/0x10 Oct 24 16:27:11 <host> kernel: [115151.674187] ? sched_clock+0x9/0x10 Oct 24 16:27:11 <host> kernel: [115151.674191] ? sched_clock_cpu+0x11/0xb0 Oct 24 16:27:11 <host> kernel: [115151.674196] wait_for_completion+0xba/0x140 Oct 24 16:27:11 <host> kernel: [115151.674199] ? wake_up_q+0x80/0x80 Oct 24 16:27:11 <host> kernel: [115151.674203] __flush_work+0x15b/0x210 Oct 24 16:27:11 <host> kernel: [115151.674206] ? worker_detach_from_pool+0xa0/0xa0 Oct 24 16:27:11 <host> kernel: [115151.674210] flush_work+0x10/0x20 Oct 24 16:27:11 <host> kernel: [115151.674250] xlog_cil_force_lsn+0x7b/0x210 [xfs] Oct 24 16:27:11 <host> kernel: [115151.674253] ? __switch_to_asm+0x41/0x70 Oct 24 16:27:11 <host> kernel: [115151.674256] ? __switch_to_asm+0x35/0x70 Oct 24 16:27:11 <host> kernel: [115151.674259] ? __switch_to_asm+0x41/0x70 Oct 24 16:27:11 <host> kernel: [115151.674262] ? __switch_to_asm+0x35/0x70 Oct 24 16:27:11 <host> kernel: [115151.674264] ? __switch_to_asm+0x41/0x70 Oct 24 16:27:11 <host> kernel: [115151.674267] ? __switch_to_asm+0x35/0x70 Oct 24 16:27:11 <host> kernel: [115151.674306] _xfs_log_force+0x8f/0x2a0 [xfs] Which wake up and wait for xlog_cil_push on the following thread: Oct 24 16:29:12 <host> kernel: [115272.503324] kworker/52:2 D 0 56479 2 0x80000080 Oct 24 16:29:12 <host> kernel: [115272.503381] Workqueue: xfs-cil/md126 xlog_cil_push_work [xfs] Oct 24 16:29:12 <host> kernel: [115272.503383] Call Trace: Oct 24 16:29:12 <host> kernel: [115272.503389] __schedule+0x24e/0x880 Oct 24 16:29:12 <host> kernel: [115272.503394] schedule+0x2c/0x80 Oct 24 16:29:12 <host> kernel: [115272.503444] xlog_state_get_iclog_space+0x105/0x2d0 [xfs] Oct 24 16:29:12 <host> kernel: [115272.503449] ? wake_up_q+0x80/0x80 Oct 24 16:29:12 <host> kernel: [115272.503492] xlog_write+0x163/0x6e0 [xfs] Oct 24 16:29:12 <host> kernel: [115272.503536] xlog_cil_push+0x2a7/0x410 [xfs] Oct 24 16:29:12 <host> kernel: [115272.503577] xlog_cil_push_work+0x15/0x20 [xfs] Oct 24 16:29:12 <host> kernel: [115272.503581] process_one_work+0x1de/0x420 Oct 24 16:29:12 <host> kernel: [115272.503584] worker_thread+0x32/0x410 Oct 24 16:29:12 <host> kernel: [115272.503587] kthread+0x121/0x140 Oct 24 16:29:12 <host> kernel: [115272.503590] ? process_one_work+0x420/0x420 Oct 24 16:29:12 <host> kernel: [115272.503593] ? kthread_create_worker_on_cpu+0x70/0x70 Oct 24 16:29:12 <host> kernel: [115272.503596] ret_from_fork+0x35/0x40 xlog_state_get_iclog_space() is waiting at: xlog_wait(&log->l_flush_wait, &log->l_icloglock); Which should be awaken by xlog_state_do_callback() which is called by xlog_state_done_syncing(), which in short, IIRC, comes from bio end io callbacks. I'm saying this by a quick look at the current code though. So, I think it ends up by your system waiting for journal IO completion to make progress (or maybe your storage stack missed an IO and XFS is stuck waiting for the completion which will never come, maybe it's deadlocked somewhere I am not seeing). I may very well be wrong though, it's been a while since I don't work on this part of the code, so, I'll just wait somebody else to give a different POV, other than risking pointing you in the wrong direction. I'll try to keep looking at it when I get some extra time, but well, at the end you will probably need to update your kernel to a recent version and try to reproduce the problem again, chasing bugs on old kernels is not really the scope of this list, but I'll give it some more thought when I get some extra time (if nobody sees what's wrong before :) > If this were to happen again though what extra information would be > helpful (I'm guessing things like /proc/meminfo output)? > > > attention at first was the wrong configured log striping for the filesystem and > > I wonder if this isn't the responsible for the amount of IO contention you are > > having in the log. This might well be generating lots of RMW cycles while > > writing to the log generating the IO contention and slowing down the rest of the > > filesystem, I'll try to take a more careful look later on. > > My understanding is that the md "chunk size" is 64k so basically > you're saying the sectsz should have been manually set to be as big as > possible at mkfs time? I never realised this never happened by default > (I see the sunit seems to be correct given the block size of 4096 but > I'm unsure about swidth)... Your RAID chunk is 512k according to the information you provided previously: md126 : active raid6 sdj[6] sdg[3] sdi[2] sdh[7] sdc[4] sdd[0] sda[5] sdb[1] 5626152960 blocks level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/7 pages [0KB], 65536KB chunk And I believe you used default mkfs options when you created the filesystem, so xfs properly identified and automatically configured it for you: sunit=128 swidth=768 blks (128*4096) = 512Kib And the swidth matches exactly the amount of data disks I'd expect from your array, i.e. 6 Data Disks (+ 2 parity): (768/128) = 6 What I meant is the Log sunit is weird in your configuration, see the 'sunit' in the Log section. The XFS log stripe unit can range from 32Kib to 256Kib. And mkfs configures it to match the data sunit, UNLESS the data sunit is bigger tha 256KiB, which then mkfs will set it to 32KiB by default, which, in your case, using default mkfs configuration, I was expecting to see a log sunit=8 (8*4096) = 32KiB Maybe your xfsprogs version had any bug, or maybe somebody set it manually. I can't really say. -- Carlos