Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Chris Mason <clm@xxxxxx> · Wed, 1 Jun 2022 14:13:42 +0000

> On Jun 1, 2022, at 8:18 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> 
> This does look sane to me. How much testing did this get?

Almost none at all, I made sure the invalidates were triggering and bashed on it with fsx, but haven’t even done xfstests yet.  The first rule about truncate is that v1 patches are always broken, so I’m expecting explosions.

> Especially
> for the block size < page sie case? Also adding Dave as he has spent
> a lot of time on this code.
> 

Sorry Dave, I thought I had you in here already.

> On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote:
>> iomap_do_writepage() sends pages past i_size through
>> folio_redirty_for_writepage(), which normally isn't a problem because
>> truncate and friends clean them very quickly.
>> 
>> When the system a variety of cgroups,
> 
> ^^^ This sentence does not parse ^^^
> 

Most of production is setup with one cgroup tree for the workloads we love and care about, and a few cgroup trees for everything else.  We tend to crank down memory or IO limits on the unloved cgroups and prioritize the workload cgroups.

This problem is hitting our mysql workloads, which are mostly O_DIRECT on a relatively small number of files.  From a kernel point of view it’s a lot of IO and not much actual resource management.  What’s happening in prod (on an older 5.6 kernel) is the non-mysql cgroups are blowing past the background dirty threshold, which kicks off the async writeback workers.

The actual call path is: wb_workfn()->wb_do_writeback()->wb_check_background_flush()->wb_writeback()->__writeback_inodes_sb()

Johannes explained to me that wb_over_bg_thresh(wb) ends up returning true on the mysql cgroups because the global background limit has been reached, even though mysql didn’t really contribute much of the dirty.  So we call down into wb_writeback(), which will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold.

In prod, bpftrace showed looping on a single inode inside a mysql cgroup.  That inode was usually in the middle of being deleted, i_size set to zero, but it still had 40-90 pages sitting in the xarray waiting for truncation.  We’d loop through the whole call path above over and over again, mostly because writepages() was returning progress had been made on this one inode.  The redirty_page_for_writepage() path does drop wbc->nr_to_write, so the rest of the writepages machinery believes real work is being done.  nr_to_write is LONG_MAX, so we’ve got a while to loop.

I had dreams of posting a trivial reproduction with two cgroups, dd, and a single file being written and truncated in a loop, which works pretty well on 5.6 and refuses to be useful upstream.   Johannes and I talked it over and we still think this patch makes sense, since the redirty path feels suboptimal.  I’ll try to make a better reproduction as well.

To give an idea of how rare this is, I’d run bpftrace for 300 seconds at a time on 10K machines and usually find a single machine in the loop.

-chris