On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote: > OK, so I read the code again and you are right that end_page_writeback() > will be called before the work is queued and thus there are no problems > with i_mutex (both with multiblock-IO code and with the old code). Sorry for > confusion. But reading the code I have a few questions: > In multiblock io submission code we call end_page_writeback() and drop > page references before we queue work converting extent from uninitialized > to initialized. But what prevents mm from reclaiming the page (and possibly > loading zeros from disk) before we finish the conversion? We hold a reference on the page (i.e., via get_page()), which we don't drop until we are doing doing the conversion. The workqueue function (ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(), which drops the page reference. > Related question is: Commit > bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of > inode reference when creating/freeing end IO work. But now I don't see what > prevents mm from reaping the inode before the workqueue gets to processing > it. Actually it was commit f7ad6d2e9201a which removed the getting and putting of the inode reference, and it describes the issues involved. Specifically: The following BUG can occur when an inode which is getting freed when it still has dirty pages outstanding, and it gets deleted (in this because it was the target of a rename). In ordered mode, we need to make sure the data pages are written just in case we crash before the rename (or unlink) is committed. If the inode is being freed then when we try to igrab the inode, we end up tripping the BUG_ON at fs/ext4/page-io.c:146. To solve this problem, we need to keep track of the number of io callbacks which are pending, and avoid destroying the inode until they have all been completed. That way we don't have to bump the inode count to keep the inode from being destroyed; an approach which doesn't work because the count could have already been dropped down to zero before the inode writeback has started (at which point we're not allowed to bump the count back up to 1, since it's already started getting freed). Thanks to Dave Chinner for suggesting this approach, which is also used by XFS. > Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the > old IO submission code but apparently it forgot to return the old handling > of uninitialized buffers so we unconditionnaly call block_write_full_page() > without specifying end_io function. Yeah. That was a bug, no question. No one apparently ran into it (or at least no one bothered to report it to linux-ext4 or linux-kernel as far as I know) while it was the default (i.e., during 2.6.37), and as of 2.6.38 we are now using the new IO submission code by default again. Given that no one has reported any problems with the new IO submission code, I was planning on completely removing the nomblk_io_submit mount option for the 3.1 kernel, which would fix the bug by virtue of nuking code in question. :-) > So to return to our original discussion regarding i_mutex - you are right > currently i_mutex does not cause a deadlock but to fix the first of the > above problems we need to somehow pin the page before all metadata is > properly updated, that metadata update requires i_mutex, and we will have > to be careful to pin the page in such a way so that memory reclaim cannot > get stuck reaping that page... If you can solve that, I'll be happy but > currently I'd find it more robust to just call end_page_writeback() after > we convert extents and avoid fs recursion from allocations. So I think the current fs/ext4/page_io.c code solves the problem simply by grabbing a reference on the bug, which should prevent the mm layer from reclaiming the page. Does that satisfy you? Do you see anything that I may have missed? (I very well could have missed something; after all, this code is subtle, which is why I didn't feel comfortable responsible until I could find the time to go and research this in depth; sorry for delay in getting back to you.) - Ted P.S. Upon further reflection, perhaps we need to drop a few more code comments in it so it's clearer to folks who try to go through this code path in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html