Hi, On Tue 05-08-08 14:03:14, Mingming Cao wrote: > 在 2008-08-04一的 20:10 +0900,Hisashi Hifumi写道: > > Hi > > > > Dio write returns EIO when try_to_release_page fails because bh is > > still referenced. > > The patch > > "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 > > Author: Mingming Cao <cmm@xxxxxxxxxx> > > Date: Fri Jul 25 01:46:22 2008 -0700 > > > > jbd: fix race between free buffer and commit transaction > > " > > was merged into 2.6.27-rc1, but I noticed that this patch is not enough > > to fix the race. > > I did fsstress test heavily to 2.6.27-rc1, and found that dio write still > > sometimes got EIO through this test. > > :( thought we beat that race pretty hard already.T > > Could you send me the fsstree command to reproduce the race? It is a part of ext3-tools package Andrew has somewhere and also LTP has it I think. > > The patch above fixed race between freeing buffer(dio) and committing > > transaction(jbd) but I discovered that there is another race, > > freeing buffer(dio) and ext3/4_ordered_writepage. > > : background_writeout() > > ->write_cache_pages() > > ->ext3_ordered_writepage() > > walk_page_buffers() <- take a bh ref > > block_write_full_page() <- unlock_page > > : <- end_page_writeback > > : <- race! (dio write->try_to_release_page fails) > > walk_page_buffers() <-release a bh ref > > > > ext3_ordered_writepage holds bh ref and does unlock_page remaining > > taking a bh ref, so this causes the race and failure of > > try_to_release_page. > > > > I thought about this before, the race seems unlikely to me. Perhaps I > missed something, but DIO code already waiting for all the pending IO to > finish before calling try_to_release_page which eventually called > journal_try_to_free_buffers(). During this call, the inode mutx is hold > to prevent the new writer (buffered/DIO) to re-dirty the pages. If there > is background writeout happens when DIO is kicked in, DIO will wait for > all the pages writeback bit clear first. here is the stack Yes, but in principle, nothing assures that writeback of buffers does not finish even before block_write_full_page() returns. So there is possibly a window after PageWriteback is cleared (and thus filemap_fdatawait() finishes) and before buffer references are dropped. Now what is more likely in practice is that all buffers of the page are written during transaction commit. So they are clean but the page remains dirty. Now background writeback happens, sees dirty page, calls: ext3_ordered_writepage() block_write_full_page() - finds all buffers are clean -> end_page_writeback() - and at this point direct IO happens which happily proceeds upto a point where try_to_release_page() fails because ext3_ordered_writepage() has not yet dropped its references to buffers. Nasty. So we really need some nice and effective way in which ->writepage() calls (and possibly others) could synchronize with try_to_release_page() (which has __GFP_WAIT and __GFP_FS and is thus willing to wait a bit). But I don't have a good candidate... Honza > generic_file_aio_write() > -> mutex_lock(&inode->i_mutex); > -> __generic_file_aio_write_nolock() > -> generic_file_direct_IO() > ->filemap_write_and_wait() > -> filemap_fdatawait() > -> wait_on_page_writeback_range() > (==== waiting for > pending IO to finish ====) > ->invalidate_inode_pages2_range() > ->invalidate_inode_pages2() > ->try_to_releasepage() > ->ext3_releasepage() > ->journal_try_to_free_buffers() > -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html