Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Chris Mason <clm@xxxxxx> · Fri, 3 Jun 2022 12:09:06 -0400

[ From a different message, Dave asks wtf my email client was doing. 
Thanks Dave, apparently exchange is being exchangey with base64 in 
unpredictable ways.  This was better in my test reply, lets see. ]

On 6/3/22 11:06 AM, Johannes Weiner wrote:
Hello Dave,

On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
As you describe above, the loops are definitely coming from higher
in the stack.  wb_writeback() will loop as long as
__writeback_inodes_wb() returns that it’s making progress and
we’re still globally over the bg threshold, so write_cache_pages()
is just being called over and over again.  We’re coming from
wb_check_background_flush(), so:

                 struct wb_writeback_work work = {
                         .nr_pages       = LONG_MAX,
                         .sync_mode      = WB_SYNC_NONE,
                         .for_background = 1,
                         .range_cyclic   = 1,
                         .reason         = WB_REASON_BACKGROUND,
                 };

Sure, but we end up in writeback_sb_inodes() which does this after
the __writeback_single_inode()->do_writepages() call that iterates
the dirty pages:

                if (need_resched()) {
                         /*
                          * We're trying to balance between building up a nice
                          * long list of IOs to improve our merge rate, and
                          * getting those IOs out quickly for anyone throttling
                          * in balance_dirty_pages().  cond_resched() doesn't
                          * unplug, so get our IOs out the door before we
                          * give up the CPU.
                          */
                         blk_flush_plug(current->plug, false);
                         cond_resched();
                 }

So if there is a pending IO completion on this CPU on a work queue
here, we'll reschedule to it because the work queue kworkers are
bound to CPUs and they take priority over user threads.

The flusher thread is also a kworker, though. So it may hit this
cond_resched(), but it doesn't yield until the timeslice expires.

Just to underline this, the long tail latencies aren't softlockups or 
major explosions.  It's just suboptimal enough that different metrics 
and dashboards noticed it.

-chris