Re: question about writeback

Ross Zwisler <zwisler@xxxxxxxxxx> · Mon, 18 Mar 2019 16:54:39 -0600

On Mon, Mar 18, 2019 at 5:38 AM Jan Kara <jack@xxxxxxx> wrote:
> On Thu 14-03-19 14:37:55, Ross Zwisler wrote:
> > On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> > > > Hi,
> > > >
> > > > I'm trying to understand a failure I'm seeing with both v4.14 and
> > > > v4.19 based kernels, and I was hoping you could point me in the right
> > > > direction.
> > > >
> > > > What seems to be happening is that under heavy I/O we get into a
> > > > situation where for a given inode/mapping we eventually reach a steady
> > > > state where one task is continuously dirtying pages and marking them
> > > > for writeback via ext4_writepages(), and another task is continuously
> > > > completing I/Os via ext4_end_bio() and clearing the
> > > > PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> > > > far as I/O is concerned.
> > > >
> > > > The problem is that another task calls filemap_fdatwait_range(), and
> > > > that call never returns because it always finds pages that are tagged
> > > > for writeback.  I've added some prints to __filemap_fdatawait_range(),
> > > > and the total number of pages tagged for writeback seems pretty
> > > > constant.  It goes up and down a bit, but does not seem to move
> > > > towards 0.  If we halt I/O the system eventually recovers, but if we
> > > > keep I/O going we can block the task waiting in
> > > > __filemap_fdatawait_range() long enough for the system to reboot due
> > > > to what it perceives as hung task.
> > > >
> > > > My question is: Is there some mechanism that is supposed to prevent
> > > > this sort of situation?  Or is it expected that with slow enough
> > > > storage and a high enough I/O load, we could block inside of
> > > > filemap_fdatawait_range() indefinitely since we never run out of dirty
> > > > pages that are marked for writeback?
> > >
> > > SO your problem is that you are doing an extending write, and then
> > > doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
> > > blocks on the pages under IO, the file is further extended and so
> > > the next radix tree lookup finds more pages past that page under
> > > writeback?
> > >
> > > i.e. because it is waiting for pages to complete, it never gets
> > > ahead of the extending write or writeback and always ends up with
> > > more pages to wait on and so never reached the end of the file as
> > > directed?
> > >
> > > So perhaps the caller should be waiting on a specific range to bound
> > > the wait (e.g.  isize as the end of the wait) rather than using the
> > > default "keep going until the end of file is reached" semantics?
> >
> > The call to __filemap_fdatawait_range() is happening via the jdb2 code:
> >
> > jbd2_journal_commit_transaction()
> >   journal_finish_inode_data_buffers()
> >     filemap_fdatawait_keep_errors()
> >       __filemap_fdatawait_range(end = LLONG_MAX)
> >
> > Would it have to be an extending write?  Or could it work the same if
> > you have one thread just moving forward through a very large file,
> > dirtying pages, and the __filemap_fdatawait_range() call will just
> > keep finding new pages as it moves forward through the big file?
>
> As Ted wrote, it must be extending write or a very large file.
> __filemap_fdatawait_range() is strictly monotone - it waits for each page
> at most once (check the loop in __filemap_fdatawait_range()). It would be
> actually good to know which case you hit if you can find it out.
>
> > In either case, I think your description of the problem is correct.
> > Is this just a "well, don't do that" type situation, or is this
> > supposed to have a different result?
>
> Let's call this a known limitation of current ext4 journalling
> implementation :) As Ted has outlined, there are plans to redesign some
> things which would also avoid this problem. But that's not a quick fix.
> Short term we could reduce the problem by tracking in jbd2 the min-max
> range that's relevant for the running transaction. It wouldn't completely
> fix it as e.g. for random writes into sparse file the problem would still
> trigger but that is far less common than continously extending file or
> sequential write into a large file.

Awesome, thank you for the replies.  I'll see if I can boil it down to
a relatively simple xfstest type reproducer, and I'll take a crack at
implemeting your suggested workaround in jbd2.

Thanks,
- Ross