Re: [PATCH] jbd2: recheck chechpointing non-dirty buffer

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Thu, 4 May 2023 19:35:29 +0800

On 2023/5/3 23:50, Jan Kara wrote:
> On Wed 26-04-23 21:10:41, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> There is a long-standing metadata corruption issue that happens from
>> time to time, but it's very difficult to reproduce and analyse, benefit
>> from the JBD2_CYCLE_RECORD option, we found out that the problem is the
>> checkpointing process miss to write out some buffers which are raced by
>> another do_get_write_access(). Looks below for detail.
>>
>> jbd2_log_do_checkpoint() //transaction X
>>  //buffer A is dirty and not belones to any transaction
>>  __buffer_relink_io() //move it to the IO list
>>  __flush_batch()
>>   write_dirty_buffer()
>>                              do_get_write_access()
>>                              clear_buffer_dirty
>>                              __jbd2_journal_file_buffer()
>>                              //add buffer A to a new transaction Y
>>    lock_buffer(bh)
>>    //doesn't write out
>>  __jbd2_journal_remove_checkpoint()
>>  //finish checkpoint except buffer A
>>  //filesystem corrupt if the new transaction Y isn't fully write out.
>>
>> The fix is subtle because we can't trust the chechpointing buffers and
>> transactions once we release the j_list_lock, they could be written back
>> and checkpointed by some others, or they could have been added to a new
>> transaction. So we have to re-add them on the checkpoint list and
>> recheck their status if they are clean and don't need to write out.
>>
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> Tested-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>
> 
> Thanks for the analysis. This indeed looks like a nasty issue to debug.  I
> think we can actually solve the problem by simplifying the checkpointing
> code in jbd2_log_do_checkpoint(), not by making it more complex. What I
> think we can do is that we can completely remove the t_checkpoint_io_list
> and only keep buffers on t_checkpoint_list. When processing
> t_checkpoint_list in jbd2_log_do_checkpoint(), we just need to make sure to
> move t_checkpoint_list pointer to the next buffer when adding buffer to
> j_chkpt_bhs array. That way buffers to submit / already submitted buffers
> will be accumulating at the tail of the list. The logic in the loop already
> handles waiting for buffers under IO / removing cleaned buffers so this
> makes sure the list will eventually get empty. Buffers cannot get redirtied
> without being removed from the checkpoint list and moved to a newer
> transaction's checkpoint list so forward progress is guaranteed. The only
> other tweak we need to add is to check for the situation when all the
> buffers are in the j_chkpt_bhs array. So the end of the loop should look
> like:
> 
> 		transaction->t_checkpoint_list = jh->j_cpnext;
> 		if (batch_count == JBD2_NR_BATCH || need_resched() ||
> 		    spin_needbreak(&journal->j_list_lock) ||
> 		    transaction->t_checkpoint_list == journal->j_chkpt_bhs[0])
> 			flush and restart
> 
> and that should be it. What do you think?
> 

This solution sounds great, Let me do it.

Thanks,
Yi.