On Tue, 2022-01-11 at 10:37 +1100, Dave Chinner wrote: > On Thu, Jan 06, 2022 at 06:36:52PM +0000, Trond Myklebust wrote: > > On Thu, 2022-01-06 at 09:48 +1100, Dave Chinner wrote: > > > On Wed, Jan 05, 2022 at 08:45:05PM +0000, Trond Myklebust wrote: > > > > On Tue, 2022-01-04 at 21:09 -0500, Trond Myklebust wrote: > > > > > On Tue, 2022-01-04 at 12:22 +1100, Dave Chinner wrote: > > > > > > On Tue, Jan 04, 2022 at 12:04:23AM +0000, Trond Myklebust > > > > > > wrote: > > > > > > > We have different reproducers. The common feature appears > > > > > > > to > > > > > > > be > > > > > > > the > > > > > > > need for a decently fast box with fairly large memory > > > > > > > (128GB > > > > > > > in > > > > > > > one > > > > > > > case, 400GB in the other). It has been reproduced with > > > > > > > HDs, > > > > > > > SSDs > > > > > > > and > > > > > > > NVME systems. > > > > > > > > > > > > > > On the 128GB box, we had it set up with 10+ disks in a > > > > > > > JBOD > > > > > > > configuration and were running the AJA system tests. > > > > > > > > > > > > > > On the 400GB box, we were just serially creating large (> > > > > > > > 6GB) > > > > > > > files > > > > > > > using fio and that was occasionally triggering the issue. > > > > > > > However > > > > > > > doing > > > > > > > an strace of that workload to disk reproduced the problem > > > > > > > faster > > > > > > > :- > > > > > > > ). > > > > > > > > > > > > Ok, that matches up with the "lots of logically sequential > > > > > > dirty > > > > > > data on a single inode in cache" vector that is required to > > > > > > create > > > > > > really long bio chains on individual ioends. > > > > > > > > > > > > Can you try the patch below and see if addresses the issue? > > > > > > > > > > > > > > > > That patch does seem to fix the soft lockups. > > > > > > > > > > > > > Oops... Strike that, apparently our tests just hit the > > > > following > > > > when > > > > running on AWS with that patch. > > > > > > OK, so there are also large contiguous physical extents being > > > allocated in some cases here. > > > > > > > So it was harder to hit, but we still did eventually. > > > > > > Yup, that's what I wanted to know - it indicates that both the > > > filesystem completion processing and the iomap page processing > > > play > > > a role in the CPU usage. More complex patch for you to try > > > below... > > > > > > Cheers, > > > > > > Dave. > > > > Hi Dave, > > > > This patch got further than the previous one. However it too failed > > on > > the same AWS setup after we started creating larger (in this case > > 52GB) > > files. The previous patch failed at 15GB. > > > > NR_06-18:00:17 pm-46088DSX1 /mnt/data-portal/data $ ls -lh > > total 59G > > -rw-r----- 1 root root 52G Jan 6 18:20 100g > > -rw-r----- 1 root root 9.8G Jan 6 17:38 10g > > -rw-r----- 1 root root 29 Jan 6 17:36 file > > NR_06-18:20:10 pm-46088DSX1 /mnt/data-portal/data $ > > Message from syslogd@pm-46088DSX1 at Jan 6 18:22:44 ... > > kernel:[ 5548.082987] watchdog: BUG: soft lockup - CPU#10 stuck > > for > > 24s! [kworker/10:0:18995] > > Ok, so coming back to this set of failures. Firstly, the patch I > sent you has a bug in it, meaning it did not merge ioends across > independent ->writepages invocations. Essentially it would merge > ioends as long as the chain of ioends are all the same size (e.g. > 4096 pages long). The moment an ioend of a different size is added > to the chain (i.e. the runt at the tail of the writepages > invocation) the merging stopped. > > That means the merging was limited to what writeback bandwidth > chunking broken the per-file writeback into. I'm generally seeing > that to be 100-200MB chunks per background writeback invocation, > and so nothing is merging beyond that size. > > Once I fixed that bug (caused by bio->bi_iter.bi_sector being > modified by the block layer stack during submission when bios are > split so it doesn't point at the start sector at IO completion), > the only way I could get merging beyond submission chunking was > to induce a long scheduling delay in the XFS completion processing. > e.g. by adding msleep(1000) to xfs_end_io() before any of the > merging occurred. > > In this case, I can't get any extra merging to occur - the > scheduling latency on IO completion is just so low and the > completion processing so fast that little to no merging occurs > at all with ioends split into 4096 page chunks. > > So, let's introduce scheduling delays. The only one that matters > here is a delay running the XFS end IO work task - it will pull all > the pending ioends and process them as one loop. Hence the only way > we can get large merged ioends is to delay the processing before we > pull the completed ioends off the inode. > > Worst case is that a scheduling delay will allow a single file to > dirty enough pages to hit the throttling limits in > balance_dirty_pages() while it waits for dirty pages to be cleaned. > I see this in completion: > > kworker/9:1-769 [009] 35.267031: bprint: iomap_finish_ioends: > pages 4096, start sector 0x400 size 10123546624 pcnt 4096 > .... > kworker/9:1-769 [009] 35.982461: bprint: iomap_finish_ioends: > pages 164, start sector 0x12db368 size 671744 pcnt 31324 > > Yup, that merged into a 10GB ioend that is passed to > iomap_finish_ioends(), and it took just over 700ms to process the > entire 10GB ioend. > > If writeback is running at the same time as completion, things are > a little slower: > > kworker/31:2-793 [031] 51.147943: bprint: iomap_finish_ioends: > pages 4096, start sector 0x132ac488 size 8019509248 pcnt 4096 > kworker/u68:13-637 [025] 51.150218: bprint: iomap_do_writepage: 2. > bios 4096, pages 4096, start sector 0x141a4488 bi_vcnt 5, bi_size > 16777216 > kworker/31:2-793 [031] 51.152237: bprint: iomap_finish_ioends: > pages 4096, start sector 0x132b4488 size 16777216 pcnt 8192 > kworker/u68:13-637 [025] 51.155773: bprint: iomap_do_writepage: 2. > bios 4096, pages 4096, start sector 0x141ac488 bi_vcnt 5, bi_size > 16777216 > .... > kworker/31:2-793 [031] 52.528398: bprint: iomap_finish_ioends: > pages 4096, start sector 0x14194488 size 16777216 pcnt 21504 > > That's 8GB in 1.4s, but it's still processing over 5GB/s of > completions. > > This is running with 64GB RAM and: > > $ grep . /proc/sys/vm/dirty* > /proc/sys/vm/dirty_background_bytes:0 > /proc/sys/vm/dirty_background_ratio:10 > /proc/sys/vm/dirty_bytes:0 > /proc/sys/vm/dirty_expire_centisecs:3000 > /proc/sys/vm/dirty_ratio:20 > /proc/sys/vm/dirtytime_expire_seconds:43200 > /proc/sys/vm/dirty_writeback_centisecs:500 > $ > > If I change the dirty ratios to 60/80: > > $ grep . /proc/sys/vm/dirty*ratio > /proc/sys/vm/dirty_background_ratio:60 > /proc/sys/vm/dirty_ratio:80 > $ > > I can get up to 15GB merged ioends with a 5 second scheduling delay > for the xfs_end_io workqueue, but that still only takes about 2s > of CPU time to process the entire completion: > > kworker/4:2-214 [004] 788.133242: bprint: xfs_end_io: off > 0x6f85fe000, sect 0x14daf0c0 size 16777216/0x8000 end 0x14db70c0 > <merges> > kworker/4:2-214 [004] 788.135393: bprint: iomap_finish_ioends: > pages 4096, start sector 0x14daf0c0 size 15837691904 pcnt 4096 > ..... > kworker/4:2-214 [004] 790.083058: bprint: iomap_finish_ioends: > pages 4096, start sector 0x16b270c0 size 16777216 pcnt 32768 > > Given that I broke physical extent merging completely in the patch > you were testing, there's no way you would even be getting GB sized > completions being run, even with large scheduling delays. There is > just no way completion is spending that amount of CPU time in a loop > processing page based writeback completion without triggering some > case of cond_resched() in the patch I gave you to test unless there > is something else happening on those systems. > > So at this point I'm kinda at a loss to understand where the 20+ > second CPU times for completion processing are coming from, even if > we're trying to process the entire 52GB of dirty pages in a single > completion. > > Trond, what is the IO completion task actually spending it's CPU > time doing on your machines? Can you trace out what the conditions > are (ioend lengths, processing time, etc) when the softlockups > occur? Are there so many pending IO completions iacross different > files that the completion CPU (CPU #10) is running out of worker > threads and/or the CPU bound completion worker threads are seeing > tens of seconds of scheduling delay? Is it something completely > external like AWS preempting the vCPU that happens to be running IO > completion for 20+ seconds at a time? Or something else entirely? > > I really need to know what I'm missing here, because it isn't > obvious from my local systems and it's not obvious just from > soft-lockup stack traces.... > > The latest patch with page based accounting and fixed ioend merging > I'm running here, including the tracepoints I've been using > ('trace-cmd record -e printk' is your friend), is below. > > Cheers, > > Dave. Yesterday I figured out a testing issue that was causing confusion among the people doing the actual testing. They were seeing hangs, which were not soft lockups, and which turned out to be artifacts of the testing methodology. With this patch, it appears that we are not reproducing the soft lockups. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx