On Wed 26-04-23 21:10:41, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > There is a long-standing metadata corruption issue that happens from > time to time, but it's very difficult to reproduce and analyse, benefit > from the JBD2_CYCLE_RECORD option, we found out that the problem is the > checkpointing process miss to write out some buffers which are raced by > another do_get_write_access(). Looks below for detail. > > jbd2_log_do_checkpoint() //transaction X > //buffer A is dirty and not belones to any transaction > __buffer_relink_io() //move it to the IO list > __flush_batch() > write_dirty_buffer() > do_get_write_access() > clear_buffer_dirty > __jbd2_journal_file_buffer() > //add buffer A to a new transaction Y > lock_buffer(bh) > //doesn't write out > __jbd2_journal_remove_checkpoint() > //finish checkpoint except buffer A > //filesystem corrupt if the new transaction Y isn't fully write out. > > The fix is subtle because we can't trust the chechpointing buffers and > transactions once we release the j_list_lock, they could be written back > and checkpointed by some others, or they could have been added to a new > transaction. So we have to re-add them on the checkpoint list and > recheck their status if they are clean and don't need to write out. > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > Tested-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> Thanks for the analysis. This indeed looks like a nasty issue to debug. I think we can actually solve the problem by simplifying the checkpointing code in jbd2_log_do_checkpoint(), not by making it more complex. What I think we can do is that we can completely remove the t_checkpoint_io_list and only keep buffers on t_checkpoint_list. When processing t_checkpoint_list in jbd2_log_do_checkpoint(), we just need to make sure to move t_checkpoint_list pointer to the next buffer when adding buffer to j_chkpt_bhs array. That way buffers to submit / already submitted buffers will be accumulating at the tail of the list. The logic in the loop already handles waiting for buffers under IO / removing cleaned buffers so this makes sure the list will eventually get empty. Buffers cannot get redirtied without being removed from the checkpoint list and moved to a newer transaction's checkpoint list so forward progress is guaranteed. The only other tweak we need to add is to check for the situation when all the buffers are in the j_chkpt_bhs array. So the end of the loop should look like: transaction->t_checkpoint_list = jh->j_cpnext; if (batch_count == JBD2_NR_BATCH || need_resched() || spin_needbreak(&journal->j_list_lock) || transaction->t_checkpoint_list == journal->j_chkpt_bhs[0]) flush and restart and that should be it. What do you think? Honza > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c > index 51bd38da21cd..1aca860eb0f6 100644 > --- a/fs/jbd2/checkpoint.c > +++ b/fs/jbd2/checkpoint.c > @@ -77,8 +77,31 @@ static inline void __buffer_relink_io(struct journal_head *jh) > jh->b_cpnext->b_cpprev = jh; > } > transaction->t_checkpoint_io_list = jh; > + transaction->t_chp_stats.cs_written++; > } > > +/* > + * Move a buffer from the checkpoint io list back to the checkpoint list > + * > + * Called with j_list_lock held > + */ > +static inline void __buffer_relink_cp(struct journal_head *jh) > +{ > + transaction_t *transaction = jh->b_cp_transaction; > + > + __buffer_unlink(jh); > + > + if (!transaction->t_checkpoint_list) { > + jh->b_cpnext = jh->b_cpprev = jh; > + } else { > + jh->b_cpnext = transaction->t_checkpoint_list; > + jh->b_cpprev = transaction->t_checkpoint_list->b_cpprev; > + jh->b_cpprev->b_cpnext = jh; > + jh->b_cpnext->b_cpprev = jh; > + } > + transaction->t_checkpoint_list = jh; > + transaction->t_chp_stats.cs_written--; > +} > /* > * Check a checkpoint buffer could be release or not. > * > @@ -175,8 +198,31 @@ __flush_batch(journal_t *journal, int *batch_count) > struct blk_plug plug; > > blk_start_plug(&plug); > - for (i = 0; i < *batch_count; i++) > - write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC); > + for (i = 0; i < *batch_count; i++) { > + struct buffer_head *bh = journal->j_chkpt_bhs[i]; > + struct journal_head *jh = bh2jh(bh); > + > + lock_buffer(bh); > + /* > + * This buffer isn't dirty, it could be getten write access > + * again by a new transaction, re-add it on the checkpoint > + * list if it still needs to be checkpointed, and wait > + * until that transaction finished to write out. > + */ > + if (!test_clear_buffer_dirty(bh)) { > + unlock_buffer(bh); > + spin_lock(&journal->j_list_lock); > + if (jh->b_cp_transaction) > + __buffer_relink_cp(jh); > + spin_unlock(&journal->j_list_lock); > + jbd2_journal_put_journal_head(jh); > + continue; > + } > + jbd2_journal_put_journal_head(jh); > + bh->b_end_io = end_buffer_write_sync; > + get_bh(bh); > + submit_bh(REQ_OP_WRITE | REQ_SYNC, bh); > + } > blk_finish_plug(&plug); > > for (i = 0; i < *batch_count; i++) { > @@ -303,9 +349,9 @@ int jbd2_log_do_checkpoint(journal_t *journal) > BUFFER_TRACE(bh, "queue"); > get_bh(bh); > J_ASSERT_BH(bh, !buffer_jwrite(bh)); > + jbd2_journal_grab_journal_head(bh); > journal->j_chkpt_bhs[batch_count++] = bh; > __buffer_relink_io(jh); > - transaction->t_chp_stats.cs_written++; > if ((batch_count == JBD2_NR_BATCH) || > need_resched() || > spin_needbreak(&journal->j_list_lock)) > -- > 2.31.1 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR