Re: [PATCH RFC] common/btrfs: use _scratch_cycle_mount to ensure all page caches are dropped

Filipe Manana <fdmanana@xxxxxxxxxx> · Tue, 20 Jun 2023 10:13:00 +0100

On Tue, Jun 20, 2023 at 2:07 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
>
>
>
> On 2023/6/20 00:13, Filipe Manana wrote:
> > On Thu, Jun 8, 2023 at 1:02 PM Qu Wenruo <wqu@xxxxxxxx> wrote:
> >>
> >> [BUG]
> >> There is a chance that btrfs/266 would fail on aarch64 with 64K page
> >> size. (No matter if it's 4K sector size or 64K sector size)
> >>
> >> The failure indicates that one or more mirrors are not properly fixed.
> >>
> >> [CAUSE]
> >> I have added some trace events into the btrfs IO paths, including
> >> __btrfs_submit_bio() and __end_bio_extent_readpage().
> >>
> >> When the test failed, the trace looks like this:
> >>
> >>   112819.764977: __btrfs_submit_bio: r/i=5/257 fileoff=0 mirror=1 len=196608 pid=33663
> >>                                      ^^^ Initial read on the full 192K file
> >>   112819.766604: __btrfs_submit_bio: r/i=5/257 fileoff=0 mirror=2 len=65536 pid=21806
> >>                                      ^^^ Repair on the first 64K block
> >>                                          Which would success
> >>   112819.766760: __btrfs_submit_bio: r/i=5/257 fileoff=65536 mirror=2 len=65536 pid=21806
> >>                                      ^^^ Repair on the second 64K block
> >>                                          Which would fail
> >>   112819.767625: __btrfs_submit_bio: r/i=5/257 fileoff=65536 mirror=3 len=65536 pid=21806
> >>                                      ^^^ Repair on the third 64K block
> >>                                          Which would success
> >>   112819.770999: end_bio_extent_readpage: read finished, r/i=5/257 fileoff=0 len=196608 mirror=1 status=0
> >>                                           ^^^ The repair succeeded, the
> >>                                               full 192K read finished.
> >>
> >>   112819.864851: __btrfs_submit_bio: r/i=5/257 fileoff=0 mirror=3 len=196608 pid=33665
> >>   112819.874854: __btrfs_submit_bio: r/i=5/257 fileoff=0 mirror=1 len=65536 pid=31369
> >>   112819.875029: __btrfs_submit_bio: r/i=5/257 fileoff=131072 mirror=1 len=65536 pid=31369
> >>   112819.876926: end_bio_extent_readpage: read finished, r/i=5/257 fileoff=0 len=196608 mirror=3 status=0
> >>
> >> But above read only happen for mirror 1 and mirror 3, mirror 2 is not
> >> involved.
> >> This means by somehow, the read on mirror 2 didn't happen, mostly
> >> due to something wrong during the drop_caches call.
> >>
> >> It may be an async operation or some hardware specific behavior.
> >>
> >> On the other hand, for test cases doing the same operation but utilizing
> >> direct IO, aka btrfs/267, it never fails as we never populate the page
> >> cache thus would always read from the disk.
> >>
> >> [WORKAROUND]
> >> The root cause is in the "echo 3 > drop_caches", which I'm still
> >> debugging.
> >>
> >> But at the same time, we can workaround the problem by forcing a
> >> cycle mount of scratch device, inside _btrfs_buffered_read_on_mirror().
> >>
> >> By this we can ensure all page caches are dropped no matter if
> >> drop_caches is working correctly.
> >>
> >> With this patch, I no longer hit the failure on aarch64 with 64K page
> >> size anymore, while before this the test case would always fail during a
> >> -I 10 run.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> >> ---
> >> RFC:
> >> The root cause is still inside that "echo 3 > drop_caches", but I don't
> >> have any better solution if such fundamental debug function is not
> >> working as expected.
> >>
> >> Thus this is only a workaround, before I can pin down the root cause of
> >> that drop_cache hiccup.
> >> ---
> >>   common/btrfs | 5 +++++
> >>   1 file changed, 5 insertions(+)
> >>
> >> diff --git a/common/btrfs b/common/btrfs
> >> index 344509ce..1d522c33 100644
> >> --- a/common/btrfs
> >> +++ b/common/btrfs
> >> @@ -599,6 +599,11 @@ _btrfs_buffered_read_on_mirror()
> >>          local size=$5
> >>
> >>          echo 3 > /proc/sys/vm/drop_caches
> >> +       # Above drop_caches doesn't seem to drop every pages on aarch64 with
> >> +       # 64K page size.
> >> +       # So here as a workaround, cycle mount the SCRATCH_MNT to ensure
> >> +       # the cache are dropped.
> >> +       _scratch_cycle_mount
> >
> > Btw, I'm getting some tests failing because of this change.
> >
> > For e.g.:
> >
> > ./check btrfs/143
> > FSTYP         -- btrfs
> > PLATFORM      -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP
> > PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
> > MKFS_OPTIONS  -- /dev/sdc
> > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> >
> > btrfs/143 6s ... [failed, exit status 1]- output mismatch (see
> > /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad)
> >      --- tests/btrfs/143.out 2020-06-10 19:29:03.818519162 +0100
> >      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad
> > 2023-06-19 17:04:00.575033899 +0100
> >      @@ -1,37 +1,6 @@
> >       QA output created by 143
> >       wrote 131072/131072 bytes
> >       XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >      -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > ................
> >      -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > ................
> >      -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > ................
> >      -XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > ................
> >      ...
> >      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/143.out
> > /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad'  to see
> > the entire diff)
> > Ran: btrfs/143
> > Failures: btrfs/143
> > Failed 1 of 1 tests
> >
> > The problem is this test is using dm-rust, and _scratch_cycle_mount()
> > doesn't work in that case.
> > It should be _unmount_dust() followed by_mount_dust() for this particular case.
>
> Any idea on a better solution?
>
> My current idea is to grab the mounted device of SCRATCH_MNT, then
> unmount and mount the grabbed device, instead of always using scratch
> device.

Maybe that, yes.

>
> But it may look a little over-complicated and would affect too many test
> cases.
>
> Or maybe we can detect if it's using dust device inside
> _btrfs_buffered_read_on_mirror() instead?

Well, that sounds fragile. Maybe there are other tests using dmflakey
instead of dmrust, or some other target.
Or maybe not now, but in the future...

>
> Thanks,
> Qu
> >
> >
> >>          while [[ -z $( (( BASHPID % nr_mirrors == mirror )) &&
> >>                  exec $XFS_IO_PROG \
> >>                          -c "pread -b $size $offset $size" $file) ]]; do
> >> --
> >> 2.39.1
> >>