On 2023/6/1 21:44, Zhihao Cheng wrote: > 在 2023/6/1 17:41, Jan Kara 写道: > > Hi, Jan >> On Wed 31-05-23 19:50:59, Zhang Yi wrote: >>> From: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> >>> >>> Following process, >>> >>> jbd2_journal_commit_transaction >>> // there are several dirty buffer heads in transaction->t_checkpoint_list >>> P1 wb_workfn >>> jbd2_log_do_checkpoint >>> if (buffer_locked(bh)) // false >>> __block_write_full_page >>> trylock_buffer(bh) >>> test_clear_buffer_dirty(bh) >>> if (!buffer_dirty(bh)) >>> __jbd2_journal_remove_checkpoint(jh) >>> if (buffer_write_io_error(bh)) // false >>> >> bh IO error occurs << >>> jbd2_cleanup_journal_tail >>> __jbd2_update_log_tail >>> jbd2_write_superblock >>> // The bh won't be replayed in next mount. >>> , which could corrupt the ext4 image, fetch a reproducer in [Link]. >>> >>> Since writeback process clears buffer dirty after locking buffer head, >>> we can fix it by checking buffer dirty firstly and then checking buffer >>> locked, the buffer head can be removed if it is neither dirty nor locked. >>> >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 >>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") >>> Signed-off-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> >>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> OK, the analysis is correct but I'm afraid the fix won't be that easy. The >> reordering of tests you did below doesn't really help because CPU or the >> compiler are free to order the loads (and stores) in whatever way they >> wish. You'd have to use memory barriers when reading and modifying bh flags >> (although the modification side is implicitely handled by the bitlock >> code) to make this work reliably. But that is IMHO too subtle for this >> code. >> > I write two litmus-test files in tools/memory-model to check the memory module of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does. **1. test-ww-rr.litmus** //simulate our changes in jbd2_log_do_checkpoint() C WW-RR (* * Result: Never * * This test shows that write-write ordering * (in P0()) is visible to external process P2(). * * bit 0: lock * bit 1: dirty *) { x=2; } P0(int *x) { int r1; r1 = READ_ONCE(*x); r1 = r1 | 1; //lock WRITE_ONCE(*x, r1); r1 = READ_ONCE(*x); r1 = r1 & 253; //&~2, clear dirty WRITE_ONCE(*x, r1); } P1(int *x) { int r0; int r1; r0 = READ_ONCE(*x); r1 = READ_ONCE(*x); } exists (1:r1=2 /\ 1:r0=1) The test results are: $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-rr.litmus Test WW-RR Allowed States 6 1:r0=1; 1:r1=1; 1:r0=2; 1:r1=1; 1:r0=2; 1:r1=2; 1:r0=2; 1:r1=3; 1:r0=3; 1:r1=1; 1:r0=3; 1:r1=3; No Witnesses Positive: 0 Negative: 6 Condition exists (1:r1=2 /\ 1:r0=1) Observation WW-RR Never 0 6 Time WW-RR 0.02 Hash=d91ecee2379f8ac878b8d06f17967874 **2. test-ww-r.litmus** //simulate our changes in __cp_buffer_busy() C WW-R (* * Result: Never * * This test shows that write-write ordering * (in P0()) is visible to external process P2(). * * bit 0: lock * bit 1: dirty *) { x=2; } P0(int *x) { int r1; r1 = READ_ONCE(*x); r1 = r1 | 1; //lock WRITE_ONCE(*x, r1); r1 = READ_ONCE(*x); r1 = r1 & 253; //&~2, clear dirty WRITE_ONCE(*x, r1); } P1(int *x) { int r0; r0 = READ_ONCE(*x); } exists (1:r0=0) The test results are: $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-r.litmus Test WW-R Allowed States 3 1:r0=1; 1:r0=2; 1:r0=3; No Witnesses Positive: 0 Negative: 3 Condition exists (1:r0=0) Observation WW-R Never 0 3 Time WW-R 0.01 Hash=d76bb39c367dc0cbd0c363a87c6c9eb7 So it looks like the out-of-order situation cannot happen, am I miss something? Thanks, Yi. > Do you mean there might be a sequence like following: > > jbd2_log_do_checkpoint > if (buffer_dirty(bh)) > else if (buffer_locked(bh)) > else > __jbd2_journal_remove_checkpoint(jh) > > CPU re-arranges the order of getting buffer state. > reg_1 = buffer_locked(bh) // false > lock_buffer(bh) > clear_buffer(bh) > reg_2 = buffer_dirty(bh) // false > > Then, jbd2_log_do_checkpoint() could become: > if (reg_2) > else if (reg_1) > else > __jbd2_journal_remove_checkpoint(jh) // enter ! > > Am I understanding right? > >> What we should be doing to avoid these races is to lock the bh. So >> something like: >> >> if (jh->b_transaction != NULL) { >> do stuff >> } >> if (!trylock_buffer(bh)) { >> buffer_locked() branch >> } >> ... Now we have the buffer locked and can safely check for dirtyness >> >> And we need to do a similar treatment for journal_clean_one_cp_list() and >> journal_shrink_one_cp_list(). >> >> BTW, I think we could merge journal_clean_one_cp_list() and >> journal_shrink_one_cp_list() into a single common function. I think we can >> drop the nr_to_scan argument and just always cleanup the whole checkpoint >> list and return the number of freed buffers. That way we have one less >> function to deal with checkpoint list cleaning. >> >> Thinking about it some more maybe we can have a function like: >> >> int jbd2_try_remove_checkpoint(struct journal_head *jh) >> { >> struct buffer_head *bh = jh2bh(jh); >> >> if (!trylock_buffer(bh) || buffer_dirty(bh)) >> return -EBUSY; >> /* >> * Buffer is clean and the IO has finished (we hold the buffer lock) so >> * the checkpoint is done. We can safely remove the buffer from this >> * transaction. >> */ >> unlock_buffer(bh); >> return __jbd2_journal_remove_checkpoint(jh); >> } >> >> and that can be used with a bit of care in the checkpointing functions as >> well as in jbd2_journal_forget(), __journal_try_to_free_buffer(), >> journal_unmap_buffer(). >> >> Honza >>