On Mon, Mar 18, 2019 at 5:38 AM Jan Kara <jack@xxxxxxx> wrote: > On Thu 14-03-19 14:37:55, Ross Zwisler wrote: > > On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote: > > > > Hi, > > > > > > > > I'm trying to understand a failure I'm seeing with both v4.14 and > > > > v4.19 based kernels, and I was hoping you could point me in the right > > > > direction. > > > > > > > > What seems to be happening is that under heavy I/O we get into a > > > > situation where for a given inode/mapping we eventually reach a steady > > > > state where one task is continuously dirtying pages and marking them > > > > for writeback via ext4_writepages(), and another task is continuously > > > > completing I/Os via ext4_end_bio() and clearing the > > > > PAGECACHE_TAG_WRITEBACK flags. So, we are making forward progress as > > > > far as I/O is concerned. > > > > > > > > The problem is that another task calls filemap_fdatwait_range(), and > > > > that call never returns because it always finds pages that are tagged > > > > for writeback. I've added some prints to __filemap_fdatawait_range(), > > > > and the total number of pages tagged for writeback seems pretty > > > > constant. It goes up and down a bit, but does not seem to move > > > > towards 0. If we halt I/O the system eventually recovers, but if we > > > > keep I/O going we can block the task waiting in > > > > __filemap_fdatawait_range() long enough for the system to reboot due > > > > to what it perceives as hung task. > > > > > > > > My question is: Is there some mechanism that is supposed to prevent > > > > this sort of situation? Or is it expected that with slow enough > > > > storage and a high enough I/O load, we could block inside of > > > > filemap_fdatawait_range() indefinitely since we never run out of dirty > > > > pages that are marked for writeback? > > > > > > SO your problem is that you are doing an extending write, and then > > > doing __filemap_fdatawait_range(end = LLONG_MAX), and while it > > > blocks on the pages under IO, the file is further extended and so > > > the next radix tree lookup finds more pages past that page under > > > writeback? > > > > > > i.e. because it is waiting for pages to complete, it never gets > > > ahead of the extending write or writeback and always ends up with > > > more pages to wait on and so never reached the end of the file as > > > directed? > > > > > > So perhaps the caller should be waiting on a specific range to bound > > > the wait (e.g. isize as the end of the wait) rather than using the > > > default "keep going until the end of file is reached" semantics? > > > > The call to __filemap_fdatawait_range() is happening via the jdb2 code: > > > > jbd2_journal_commit_transaction() > > journal_finish_inode_data_buffers() > > filemap_fdatawait_keep_errors() > > __filemap_fdatawait_range(end = LLONG_MAX) > > > > Would it have to be an extending write? Or could it work the same if > > you have one thread just moving forward through a very large file, > > dirtying pages, and the __filemap_fdatawait_range() call will just > > keep finding new pages as it moves forward through the big file? > > As Ted wrote, it must be extending write or a very large file. > __filemap_fdatawait_range() is strictly monotone - it waits for each page > at most once (check the loop in __filemap_fdatawait_range()). It would be > actually good to know which case you hit if you can find it out. > > > In either case, I think your description of the problem is correct. > > Is this just a "well, don't do that" type situation, or is this > > supposed to have a different result? > > Let's call this a known limitation of current ext4 journalling > implementation :) As Ted has outlined, there are plans to redesign some > things which would also avoid this problem. But that's not a quick fix. > Short term we could reduce the problem by tracking in jbd2 the min-max > range that's relevant for the running transaction. It wouldn't completely > fix it as e.g. for random writes into sparse file the problem would still > trigger but that is far less common than continously extending file or > sequential write into a large file. Awesome, thank you for the replies. I'll see if I can boil it down to a relatively simple xfstest type reproducer, and I'll take a crack at implemeting your suggested workaround in jbd2. Thanks, - Ross