On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote: > > Hi, > > > > I'm trying to understand a failure I'm seeing with both v4.14 and > > v4.19 based kernels, and I was hoping you could point me in the right > > direction. > > > > What seems to be happening is that under heavy I/O we get into a > > situation where for a given inode/mapping we eventually reach a steady > > state where one task is continuously dirtying pages and marking them > > for writeback via ext4_writepages(), and another task is continuously > > completing I/Os via ext4_end_bio() and clearing the > > PAGECACHE_TAG_WRITEBACK flags. So, we are making forward progress as > > far as I/O is concerned. > > > > The problem is that another task calls filemap_fdatwait_range(), and > > that call never returns because it always finds pages that are tagged > > for writeback. I've added some prints to __filemap_fdatawait_range(), > > and the total number of pages tagged for writeback seems pretty > > constant. It goes up and down a bit, but does not seem to move > > towards 0. If we halt I/O the system eventually recovers, but if we > > keep I/O going we can block the task waiting in > > __filemap_fdatawait_range() long enough for the system to reboot due > > to what it perceives as hung task. > > > > My question is: Is there some mechanism that is supposed to prevent > > this sort of situation? Or is it expected that with slow enough > > storage and a high enough I/O load, we could block inside of > > filemap_fdatawait_range() indefinitely since we never run out of dirty > > pages that are marked for writeback? > > SO your problem is that you are doing an extending write, and then > doing __filemap_fdatawait_range(end = LLONG_MAX), and while it > blocks on the pages under IO, the file is further extended and so > the next radix tree lookup finds more pages past that page under > writeback? > > i.e. because it is waiting for pages to complete, it never gets > ahead of the extending write or writeback and always ends up with > more pages to wait on and so never reached the end of the file as > directed? > > So perhaps the caller should be waiting on a specific range to bound > the wait (e.g. isize as the end of the wait) rather than using the > default "keep going until the end of file is reached" semantics? The call to __filemap_fdatawait_range() is happening via the jdb2 code: jbd2_journal_commit_transaction() journal_finish_inode_data_buffers() filemap_fdatawait_keep_errors() __filemap_fdatawait_range(end = LLONG_MAX) Would it have to be an extending write? Or could it work the same if you have one thread just moving forward through a very large file, dirtying pages, and the __filemap_fdatawait_range() call will just keep finding new pages as it moves forward through the big file? In either case, I think your description of the problem is correct. Is this just a "well, don't do that" type situation, or is this supposed to have a different result? - Ross