Thanks a lot for your in-depth explanation, I check the code and get it now. Will modify the patch as you suggested and post v2 after test. Thanks, Yi. On 2019/1/11 18:30, Jan Kara Wrote: > On Fri 11-01-19 14:11:31, zhangyi (F) wrote: >> On 2019/1/10 19:20, Jan Kara Wrote: >>> On Thu 10-01-19 14:12:02, zhangyi (F) wrote: >>>> Now, we capture a data corruption problem on ext4 while we're truncating >>>> an extent index block. Imaging that if we are revoking a buffer which >>>> has been journaled by the committing transaction, the buffer's jbddirty >>>> flag will not be cleared in jbd2_journal_forget(), so the commit code >>>> will set the buffer dirty flag again after refile the buffer. >>>> >>>> fsx kjournald2 >>>> jbd2_journal_commit_transaction >>>> jbd2_journal_revoke commit phase 1~5... >>>> jbd2_journal_forget >>>> belongs to older transaction commit phase 6 >>>> jbddirty not clear __jbd2_journal_refile_buffer >>>> __jbd2_journal_unfile_buffer >>>> test_clear_buffer_jbddirty >>>> mark_buffer_dirty >>>> >>>> Finally, if the freed extent index block was allocated again as data >>>> block by some other files, it may corrupt the file data when writing >>>> cached pages later, such as during umount time. >>>> >>>> This patch mark buffer as freed when it already belongs to the >>>> committing transaction in jbd2_journal_forget(), so that commit code >>>> knows it should clear dirty bits when it is done with the buffer. >>>> >>>> This problem can be reproduced by xfstests generic/455 easily with >>>> seeds (3246 3247 3248 3249). >>>> >>>> Signed-off-by: zhangyi (F) <yi.zhang@xxxxxxxxxx> >>>> Cc: stable@xxxxxxxxxxxxxxx >>> >>> Thanks a lot for the analysis and the patch! I fully agree with your >>> analysis however I think just setting buffer as freed isn't completely >>> correct. The problem is following: The metadata buffer X has been modified >>> by the commiting transaction - let's call it A. It has been freed in the >>> currently running transaction B. Now jbd2_journal_forget() clears >>> b_next_transaction and if you set buffer freed flag, X will not be added to >>> the checkpoint list. So when transaction A finishes commit, it can get >>> checkpointed (without writing out X) before transaction B commits. So if a >>> crash occurs before B commits, we'd loose modification of X from >>> transaction A and thus cause filesystem corruption. >>> >> Thanks for your explanation! There are still two points I don't quite >> understand. >> >> I check all three cases of doing checkpoint. IIUC, both jbd2_journal_destroy() >> and jbd2_journal_flush() wait the current running transaction B to complete >> before doing checkpoint besides __jbd2_log_wait_for_space(). So I guess this is >> the case that you mentioned of transaction A could be checkpointed before B >> commits, am I right? > > Yes, __jbd2_log_wait_for_space() can checkpoint already committed > transactions (i.e., A in our case) without waiting for the running > transaction (B in our case). > >> For another case, jbd2_update_log_tail() will be invoked after transaction B >> complete, so the problem above also can't happen here, right? > > I'm not sure which "another case" you speak about here... > >>> What rather needs to happen is the same thing that is done in >>> journal_unmap_buffer() in this case: We set buffer freed flag and we also >>> set b_next_transaction to the currently running transaction (B). This will >>> prevent A from being checkpointed before B commits and thus avoids the >>> problem above. >>> >> Sorry, I don't get this point. I find that the difference between setting >> b_next_transaction or not is just re-added the buffer X to the BJ_Reserved >> list or not. How could we avoid the problem above. > > Currently, X will be removed from transaction B by jbd2_journal_revoke(). > So once A commits, it will not be in the running transaction and thus > checkpoint of A can complete before B is committed. > > If we set X->b_next_transaction to B, X will be part of transaction B. The > handling of buffer_freed() buffer in commit code thus will not clear > jbddirty bit and X will get inserted in X as buffer for checkpointing. And > thus checkpoint of A will not be able to complete before B commits, fixing > the problem I have described. > >> BTW, I am thinking of a similar case. If we modify buffer X instead of >> revork it in the transaction B, we also need to avoid transaction A from >> being checkpointed before B commits, because current buffer X contains the >> modified data (modified by B). So we should prevent writing it before >> B commits, otherwise it will corrupt metadata. How do we handle this >> situation now? > > Buffers that are part of the running transaction never have buffer_dirty > bit set (look how jbd2_journal_file_buffer() clears this bit). Thus > background writeback will not write these buffers. Also checkpointing code > checks whether the buffer is part of running / committing transaction and > handles these buffers specially exactly because they cannot be written out > directly. > > Honza >