Re: [PATCH 4/5] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Thu, 1 Jun 2023 22:20:38 +0800

On 2023/6/1 21:44, Zhihao Cheng wrote:
> 在 2023/6/1 17:41, Jan Kara 写道:
> 
> Hi, Jan
>> On Wed 31-05-23 19:50:59, Zhang Yi wrote:
>>> From: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>
>>>
>>> Following process,
>>>
>>> jbd2_journal_commit_transaction
>>> // there are several dirty buffer heads in transaction->t_checkpoint_list
>>>            P1                   wb_workfn
>>> jbd2_log_do_checkpoint
>>>   if (buffer_locked(bh)) // false
>>>                              __block_write_full_page
>>>                               trylock_buffer(bh)
>>>                               test_clear_buffer_dirty(bh)
>>>   if (!buffer_dirty(bh))
>>>    __jbd2_journal_remove_checkpoint(jh)
>>>     if (buffer_write_io_error(bh)) // false
>>>                               >> bh IO error occurs <<
>>>   jbd2_cleanup_journal_tail
>>>    __jbd2_update_log_tail
>>>     jbd2_write_superblock
>>>     // The bh won't be replayed in next mount.
>>> , which could corrupt the ext4 image, fetch a reproducer in [Link].
>>>
>>> Since writeback process clears buffer dirty after locking buffer head,
>>> we can fix it by checking buffer dirty firstly and then checking buffer
>>> locked, the buffer head can be removed if it is neither dirty nor locked.
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
>>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
>>> Signed-off-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>
>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
>> reordering of tests you did below doesn't really help because CPU or the
>> compiler are free to order the loads (and stores) in whatever way they
>> wish. You'd have to use memory barriers when reading and modifying bh flags
>> (although the modification side is implicitely handled by the bitlock
>> code) to make this work reliably. But that is IMHO too subtle for this
>> code.
>>
> 

I write two litmus-test files in tools/memory-model to check the memory module
of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does.

**1. test-ww-rr.litmus**  //simulate our changes in jbd2_log_do_checkpoint()

  C WW-RR

  (*
   * Result: Never
   *
   * This test shows that write-write ordering
   * (in P0()) is visible to external process P2().
   *
   * bit 0: lock
   * bit 1: dirty
   *)

  {
          x=2;
  }

  P0(int *x)
  {
          int r1;

          r1 = READ_ONCE(*x);
          r1 = r1 | 1;  //lock
          WRITE_ONCE(*x, r1);

          r1 = READ_ONCE(*x);
          r1 = r1 & 253;  //&~2, clear dirty
          WRITE_ONCE(*x, r1);
  }

  P1(int *x)
  {
          int r0;
          int r1;

          r0 = READ_ONCE(*x);
          r1 = READ_ONCE(*x);
  }

  exists (1:r1=2 /\ 1:r0=1)

The test results are:

  $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-rr.litmus
  Test WW-RR Allowed
  States 6
  1:r0=1; 1:r1=1;
  1:r0=2; 1:r1=1;
  1:r0=2; 1:r1=2;
  1:r0=2; 1:r1=3;
  1:r0=3; 1:r1=1;
  1:r0=3; 1:r1=3;
  No
  Witnesses
  Positive: 0 Negative: 6
  Condition exists (1:r1=2 /\ 1:r0=1)
  Observation WW-RR Never 0 6
  Time WW-RR 0.02
  Hash=d91ecee2379f8ac878b8d06f17967874

**2. test-ww-r.litmus**  //simulate our changes in __cp_buffer_busy()

  C WW-R

  (*
   * Result: Never
   *
   * This test shows that write-write ordering
   * (in P0()) is visible to external process P2().
   *
   * bit 0: lock
   * bit 1: dirty
   *)

  {
          x=2;
  }

  P0(int *x)
  {
          int r1;

          r1 = READ_ONCE(*x);
          r1 = r1 | 1;  //lock
          WRITE_ONCE(*x, r1);

          r1 = READ_ONCE(*x);
          r1 = r1 & 253;  //&~2, clear dirty
          WRITE_ONCE(*x, r1);
  }

  P1(int *x)
  {
          int r0;

          r0 = READ_ONCE(*x);
  }

  exists (1:r0=0)

The test results are:

  $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-r.litmus
  Test WW-R Allowed
  States 3
  1:r0=1;
  1:r0=2;
  1:r0=3;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (1:r0=0)
  Observation WW-R Never 0 3
  Time WW-R 0.01
  Hash=d76bb39c367dc0cbd0c363a87c6c9eb7

So it looks like the out-of-order situation cannot happen, am I miss something?

Thanks,
Yi.

> Do you mean there might be a sequence like following:
> 
> jbd2_log_do_checkpoint
>  if (buffer_dirty(bh))
>  else if (buffer_locked(bh))
>  else
>    __jbd2_journal_remove_checkpoint(jh)
> 
> CPU re-arranges the order of getting buffer state.
> reg_1 = buffer_locked(bh)  // false
>                            lock_buffer(bh)
>                            clear_buffer(bh)
> reg_2 = buffer_dirty(bh)   // false
> 
> Then, jbd2_log_do_checkpoint() could become:
> if (reg_2)
> else if (reg_1)
> else
>   __jbd2_journal_remove_checkpoint(jh)  // enter !
> 
> Am I understanding right?
> 
>> What we should be doing to avoid these races is to lock the bh. So
>> something like:
>>
>>     if (jh->b_transaction != NULL) {
>>         do stuff
>>     }
>>     if (!trylock_buffer(bh)) {
>>         buffer_locked() branch
>>     }
>>     ... Now we have the buffer locked and can safely check for dirtyness
>>
>> And we need to do a similar treatment for journal_clean_one_cp_list() and
>> journal_shrink_one_cp_list().
>>
>> BTW, I think we could merge journal_clean_one_cp_list() and
>> journal_shrink_one_cp_list() into a single common function. I think we can
>> drop the nr_to_scan argument and just always cleanup the whole checkpoint
>> list and return the number of freed buffers. That way we have one less
>> function to deal with checkpoint list cleaning.
>>
>> Thinking about it some more maybe we can have a function like:
>>
>> int jbd2_try_remove_checkpoint(struct journal_head *jh)
>> {
>>     struct buffer_head *bh = jh2bh(jh);
>>
>>     if (!trylock_buffer(bh) || buffer_dirty(bh))
>>         return -EBUSY;
>>     /*
>>      * Buffer is clean and the IO has finished (we hold the buffer lock) so
>>      * the checkpoint is done. We can safely remove the buffer from this
>>      * transaction.
>>      */
>>     unlock_buffer(bh);
>>     return __jbd2_journal_remove_checkpoint(jh);
>> }
>>
>> and that can be used with a bit of care in the checkpointing functions as
>> well as in jbd2_journal_forget(), __journal_try_to_free_buffer(),
>> journal_unmap_buffer().
>>
>>                                 Honza
>>