On 2023/5/3 23:50, Jan Kara wrote: > On Wed 26-04-23 21:10:41, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> There is a long-standing metadata corruption issue that happens from >> time to time, but it's very difficult to reproduce and analyse, benefit >> from the JBD2_CYCLE_RECORD option, we found out that the problem is the >> checkpointing process miss to write out some buffers which are raced by >> another do_get_write_access(). Looks below for detail. >> >> jbd2_log_do_checkpoint() //transaction X >> //buffer A is dirty and not belones to any transaction >> __buffer_relink_io() //move it to the IO list >> __flush_batch() >> write_dirty_buffer() >> do_get_write_access() >> clear_buffer_dirty >> __jbd2_journal_file_buffer() >> //add buffer A to a new transaction Y >> lock_buffer(bh) >> //doesn't write out >> __jbd2_journal_remove_checkpoint() >> //finish checkpoint except buffer A >> //filesystem corrupt if the new transaction Y isn't fully write out. >> >> The fix is subtle because we can't trust the chechpointing buffers and >> transactions once we release the j_list_lock, they could be written back >> and checkpointed by some others, or they could have been added to a new >> transaction. So we have to re-add them on the checkpoint list and >> recheck their status if they are clean and don't need to write out. >> >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> Tested-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> > > Thanks for the analysis. This indeed looks like a nasty issue to debug. I > think we can actually solve the problem by simplifying the checkpointing > code in jbd2_log_do_checkpoint(), not by making it more complex. What I > think we can do is that we can completely remove the t_checkpoint_io_list > and only keep buffers on t_checkpoint_list. When processing > t_checkpoint_list in jbd2_log_do_checkpoint(), we just need to make sure to > move t_checkpoint_list pointer to the next buffer when adding buffer to > j_chkpt_bhs array. That way buffers to submit / already submitted buffers > will be accumulating at the tail of the list. The logic in the loop already > handles waiting for buffers under IO / removing cleaned buffers so this > makes sure the list will eventually get empty. Buffers cannot get redirtied > without being removed from the checkpoint list and moved to a newer > transaction's checkpoint list so forward progress is guaranteed. The only > other tweak we need to add is to check for the situation when all the > buffers are in the j_chkpt_bhs array. So the end of the loop should look > like: > > transaction->t_checkpoint_list = jh->j_cpnext; > if (batch_count == JBD2_NR_BATCH || need_resched() || > spin_needbreak(&journal->j_list_lock) || > transaction->t_checkpoint_list == journal->j_chkpt_bhs[0]) > flush and restart > > and that should be it. What do you think? > This solution sounds great, Let me do it. Thanks, Yi.