Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Chris Mason <clm@xxxxxx> · Fri, 3 Jun 2022 01:29:40 +0000

> On Jun 2, 2022, at 6:06 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner wrote:
>> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
>>> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
>>>> In prod, bpftrace showed looping on a single inode inside a mysql
>>>> cgroup. That inode was usually in the middle of being deleted,
>>>> i_size set to zero, but it still had 40-90 pages sitting in the
>>>> xarray waiting for truncation. We’d loop through the whole call
>>>> path above over and over again, mostly because writepages() was
>>>> returning progress had been made on this one inode. The
>>>> redirty_page_for_writepage() path does drop wbc->nr_to_write, so
>>>> the rest of the writepages machinery believes real work is being
>>>> done. nr_to_write is LONG_MAX, so we’ve got a while to loop.
>>> 
>>> Yup, this code relies on truncate making progress to avoid looping
>>> forever. Truncate should only block on the page while it locks it
>>> and waits for writeback to complete, then it gets forcibly
>>> invalidated and removed from the page cache.
>> 
>> It's not looping forever, truncate can just take a relatively long
>> time during which the flusher is busy-spinning full bore on a
>> relatively small number of unflushable pages (range_cyclic).
>> 
>> But you raise a good point asking "why is truncate stuck?". I first
>> thought they might be cannibalizing each other over the page locks,
>> but that wasn't it (and wouldn't explain the clear asymmetry between
>> truncate and flusher). That leaves the waiting for writeback. I just
>> confirmed with tracing that that's exactly where truncate sits while
>> the flusher goes bananas on the same inode. So the race must be this:
>> 
>> truncate: flusher
>> put a subset of pages under writeback
>> i_size_write(0)
>> wait_on_page_writeback()
>> loop with range_cyclic over remaining dirty >EOF pages
> 
> But write_cache_pages() doesn't repeatedly loop over the pages.
> 
> The flusher is
> 
> ->writepages
> iomap_writepages
> write_cache_pages()
> loop over mapping tree
> lock page
> 	iomap_do_writepage
> 	 set_page_writeback()
> 	 add page to ioend
> <end of mapping reached>
> iomap_submit_ioend()
> <pages under writeback get sent for IO>
> return to high level writeback
> 
> And eventually IO completion will clear page writeback state.
> 

Yes, this is actually happening before the truncate starts.  The truncate finds these writeback pages and waits for them to finish IO, and while it’s waiting wb_check_background_flush() goes wild on the redirty path.

> i.e. write_cache_pages() should not be hard looping over the pages
> beyond EOF even if range_cyclic is set - it's skipping those pages,
> submitting any that are under writeback, and the, going back to high
> level code for it to make a decision about continuation of
> writeback. It may call back down and we loop over dirty pages beyond
> EOF again, but the flusher should not be holding on to pages under
> writeback for any signification length of time before they are
> submitted for IO.
> 

I spent a while trying to blame write_cache_pages() for looping repeatedly, and ended up making a series of bpftrace scripts that collected call stack frequency counters for all the different ways to wander into write_cache_pages().  I eventually ran it across 100K systems to figure out exactly how we were getting into trouble.  It was (thankfully) really consistent.

As you describe above, the loops are definitely coming from higher in the stack.  wb_writeback() will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold, so write_cache_pages() is just being called over and over again.  We’re coming from wb_check_background_flush(), so:

                struct wb_writeback_work work = {
                        .nr_pages       = LONG_MAX,
                        .sync_mode      = WB_SYNC_NONE,
                        .for_background = 1,
                        .range_cyclic   = 1,
                        .reason         = WB_REASON_BACKGROUND,
                };

> IOWs, if truncate is getting stuck waiting on writeback, then that
> implies something is holding up IO completions for a long time,

>From Johannes’s tracing today, that’s about 17us.  Our victim cgroup has just a handful of files, so we can burn through a lot of write_cache_pages loops in the time truncate is waiting for a single IO on a relatively fast ssd.

> not
> that there's a problem in writeback submission. i.e. you might
> actually be looking at a workqueue backlog

I actually think a workqueue backlog is coming from the flusher thread hogging the CPU.  The investigation started by looking for long tail latencies in write() systemcalls, and Domas’s original finding was IO completion workers were waiting for CPUs.  That’s how he ended up finding the redirty calls using high levels of CPU.  We honestly won’t be sure until we live patch a lot of boxes and look for long tail latency improvements, but I’m pretty optimistic.

-chris