On Tue, Sep 06, 2011 at 12:02:53PM +0200, Jan Kara wrote: > Umm, I don't quite understand the race. It seems to me we can block on > lack of journal space during writeback only in ext4_da_writepages() where > we do ext4_journal_start(). But at that point there shouldn't be any pages > with PageWriteback set which were not submitted for IO. So it's not clear > to me why PageWriteback bit does not get cleared when IO finishes and > kjournald would then continue. Is it because IO completion is somehow > blocked on journal? No, you're right. I looked more closely at it and the problem is because of the end_io handling. There was a thead blocked in mpage_da_map_and_submit path but it was blocked in ext4_get_wite_access(), and it was not holding PageWriteback at the time. > That would seem possible as I'm looking e.g. into > ext4_convert_unwritten_extents(). But then any waiting for writeback from > kjournald is prone to deadlocks. In fact regardless of what kjournald does > there does not seem to be sane lock ordering since we have: > > transaction start -> PageWriteback -> transaction start > ^ ^ > | end_io handling if I'm right > enforced by write_cache_pages_da > > Which is really nasty. We cannot end page writeback until the page is > readable from disk which means until we have properly updated extent tree. > But for extent tree update we need a transaction. The only way out I can > see is to reserve space for extent tree update in a transaction in > writepages() and holding the transaction open until end_io time when we > change extent tree and close the transaction. But I'm afraid that's going > to suck under heavy load... Yes, that's a problem with ext4_convert_unwritten_extents() being called out of end_io handling, and of course dioread_nolock does a lot more of that, hence why it shows up in that mode. I think the long-term solution here is that we have to reserve space and make the allocation decision at writepages() time, but we don't actually modify any on-disk state, hence we don't have to hold a transaction open. We just prevent those disk blocks from getting allocated anywhere else, and we tentative assoicate physical blocks with the logical block numbers, but in a memory structure only. Then when the pages are written, we can drop PageWriteback. We don't have to wait until the extent blocks are written to disk, so long as any callers of ext4_map_blocks() get the right information (via an in-memory cache that we would have to add to make this whole thing first), and so long as fsync() not only calls filemap_fdatawrite(), but also waits for the metadata updates to the extent tree/indirect blocks have been completed. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html