On Tue 06-09-11 11:36:01, Ted Tso wrote: > On Tue, Sep 06, 2011 at 12:02:53PM +0200, Jan Kara wrote: > > That would seem possible as I'm looking e.g. into > > ext4_convert_unwritten_extents(). But then any waiting for writeback from > > kjournald is prone to deadlocks. In fact regardless of what kjournald does > > there does not seem to be sane lock ordering since we have: > > > > transaction start -> PageWriteback -> transaction start > > ^ ^ > > | end_io handling if I'm right > > enforced by write_cache_pages_da > > > > Which is really nasty. We cannot end page writeback until the page is > > readable from disk which means until we have properly updated extent tree. > > But for extent tree update we need a transaction. The only way out I can > > see is to reserve space for extent tree update in a transaction in > > writepages() and holding the transaction open until end_io time when we > > change extent tree and close the transaction. But I'm afraid that's going > > to suck under heavy load... > > Yes, that's a problem with ext4_convert_unwritten_extents() being > called out of end_io handling, and of course dioread_nolock does a lot > more of that, hence why it shows up in that mode. > > I think the long-term solution here is that we have to reserve space > and make the allocation decision at writepages() time, but we don't > actually modify any on-disk state, hence we don't have to hold a > transaction open. We just prevent those disk blocks from getting > allocated anywhere else, and we tentative assoicate physical blocks > with the logical block numbers, but in a memory structure only. Then > when the pages are written, we can drop PageWriteback. We don't have > to wait until the extent blocks are written to disk, so long as any > callers of ext4_map_blocks() get the right information (via an > in-memory cache that we would have to add to make this whole thing > first), and so long as fsync() not only calls filemap_fdatawrite(), > but also waits for the metadata updates to the extent tree/indirect > blocks have been completed. I have originally disregarded this option because it seemed to fragile. But as you outline it here, it looks it should be doable. Honza -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html