On Tue, Sep 6, 2011 at 11:36 PM, Ted Ts'o <tytso@xxxxxxx> wrote: > On Tue, Sep 06, 2011 at 12:02:53PM +0200, Jan Kara wrote: >> Umm, I don't quite understand the race. It seems to me we can block on >> lack of journal space during writeback only in ext4_da_writepages() where >> we do ext4_journal_start(). But at that point there shouldn't be any pages >> with PageWriteback set which were not submitted for IO. So it's not clear >> to me why PageWriteback bit does not get cleared when IO finishes and >> kjournald would then continue. Is it because IO completion is somehow >> blocked on journal? > > No, you're right. I looked more closely at it and the problem is > because of the end_io handling. There was a thead blocked in > mpage_da_map_and_submit path but it was blocked in > ext4_get_wite_access(), and it was not holding PageWriteback at the > time. > >> That would seem possible as I'm looking e.g. into >> ext4_convert_unwritten_extents(). But then any waiting for writeback from >> kjournald is prone to deadlocks. In fact regardless of what kjournald does >> there does not seem to be sane lock ordering since we have: >> >> transaction start -> PageWriteback -> transaction start >> ^ ^ >> | end_io handling if I'm right >> enforced by write_cache_pages_da >> >> Which is really nasty. We cannot end page writeback until the page is >> readable from disk which means until we have properly updated extent tree. >> But for extent tree update we need a transaction. The only way out I can >> see is to reserve space for extent tree update in a transaction in >> writepages() and holding the transaction open until end_io time when we >> change extent tree and close the transaction. But I'm afraid that's going >> to suck under heavy load... > > Yes, that's a problem with ext4_convert_unwritten_extents() being > called out of end_io handling, and of course dioread_nolock does a lot > more of that, hence why it shows up in that mode. > > I think the long-term solution here is that we have to reserve space > and make the allocation decision at writepages() time, but we don't > actually modify any on-disk state, hence we don't have to hold a > transaction open. We just prevent those disk blocks from getting > allocated anywhere else, and we tentative assoicate physical blocks > with the logical block numbers, but in a memory structure only. Then > when the pages are written, we can drop PageWriteback. We don't have > to wait until the extent blocks are written to disk, so long as any > callers of ext4_map_blocks() get the right information (via an > in-memory cache that we would have to add to make this whole thing > first), and so long as fsync() not only calls filemap_fdatawrite(), > but also waits for the metadata updates to the extent tree/indirect > blocks have been completed. Hi Ted, It seems that this can be implemented on delayed extent tree(RB tree) easily, we just allocate blocks from buddy allocator and associate them to logical blocks with delayed extent. fsync inserts delayed extents into extent tree. BTW: I have sent out patch series implementing delayed extent list, which will use RB tree later. The patch series pass all xfstests except 74, there was a deadlock there, I could not find it out. What's your opinion? Yongqiang. > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best Wishes Yongqiang Yang -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html