Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 3 Jun 2022 11:06:22 -0400

Hello Dave,

On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
> On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > As you describe above, the loops are definitely coming from higher
> > in the stack.  wb_writeback() will loop as long as
> > __writeback_inodes_wb() returns that it’s making progress and
> > we’re still globally over the bg threshold, so write_cache_pages()
> > is just being called over and over again.  We’re coming from
> > wb_check_background_flush(), so:
> > 
> >                 struct wb_writeback_work work = {
> >                         .nr_pages       = LONG_MAX,
> >                         .sync_mode      = WB_SYNC_NONE,
> >                         .for_background = 1,
> >                         .range_cyclic   = 1,
> >                         .reason         = WB_REASON_BACKGROUND,
> >                 };
> 
> Sure, but we end up in writeback_sb_inodes() which does this after
> the __writeback_single_inode()->do_writepages() call that iterates
> the dirty pages:
> 
>                if (need_resched()) {
>                         /*
>                          * We're trying to balance between building up a nice
>                          * long list of IOs to improve our merge rate, and
>                          * getting those IOs out quickly for anyone throttling
>                          * in balance_dirty_pages().  cond_resched() doesn't
>                          * unplug, so get our IOs out the door before we
>                          * give up the CPU.
>                          */
>                         blk_flush_plug(current->plug, false);
>                         cond_resched();
>                 }
> 
> So if there is a pending IO completion on this CPU on a work queue
> here, we'll reschedule to it because the work queue kworkers are
> bound to CPUs and they take priority over user threads.

The flusher thread is also a kworker, though. So it may hit this
cond_resched(), but it doesn't yield until the timeslice expires.

> Also, this then requeues the inode of the b_more_io queue, and
> wb_check_background_flush() won't come back to it until all other
> inodes on all other superblocks on the bdi have had writeback
> attempted. So if the system truly is over the background dirty
> threshold, why is writeback getting stuck on this one inode in this
> way?

The explanation for this part at least is that the bdi/flush domain is
split per cgroup. The cgroup in question is over its proportional bg
thresh. It has very few dirty pages, but it also has very few
*dirtyable* pages, which makes for a high dirty ratio. And those
handful of dirty pages are the unflushable ones past EOF.

There is no next inode to move onto on subsequent loops.