Hello Dave, On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote: > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: > > As you describe above, the loops are definitely coming from higher > > in the stack. wb_writeback() will loop as long as > > __writeback_inodes_wb() returns that it’s making progress and > > we’re still globally over the bg threshold, so write_cache_pages() > > is just being called over and over again. We’re coming from > > wb_check_background_flush(), so: > > > > struct wb_writeback_work work = { > > .nr_pages = LONG_MAX, > > .sync_mode = WB_SYNC_NONE, > > .for_background = 1, > > .range_cyclic = 1, > > .reason = WB_REASON_BACKGROUND, > > }; > > Sure, but we end up in writeback_sb_inodes() which does this after > the __writeback_single_inode()->do_writepages() call that iterates > the dirty pages: > > if (need_resched()) { > /* > * We're trying to balance between building up a nice > * long list of IOs to improve our merge rate, and > * getting those IOs out quickly for anyone throttling > * in balance_dirty_pages(). cond_resched() doesn't > * unplug, so get our IOs out the door before we > * give up the CPU. > */ > blk_flush_plug(current->plug, false); > cond_resched(); > } > > So if there is a pending IO completion on this CPU on a work queue > here, we'll reschedule to it because the work queue kworkers are > bound to CPUs and they take priority over user threads. The flusher thread is also a kworker, though. So it may hit this cond_resched(), but it doesn't yield until the timeslice expires. > Also, this then requeues the inode of the b_more_io queue, and > wb_check_background_flush() won't come back to it until all other > inodes on all other superblocks on the bdi have had writeback > attempted. So if the system truly is over the background dirty > threshold, why is writeback getting stuck on this one inode in this > way? The explanation for this part at least is that the bdi/flush domain is split per cgroup. The cgroup in question is over its proportional bg thresh. It has very few dirty pages, but it also has very few *dirtyable* pages, which makes for a high dirty ratio. And those handful of dirty pages are the unflushable ones past EOF. There is no next inode to move onto on subsequent loops.