Re: [PATCH] btrfs/213: avoid occasional failure due to already finished balance

Anand Jain <anand.jain@xxxxxxxxxx> · Fri, 19 May 2023 13:04:46 +0800

On 18/5/23 19:08, fdmanana@xxxxxxxxxx wrote:
From: Filipe Manana <fdmanana@xxxxxxxx>

btrfs/213 writes data, in 1M extents, for 4 seconds into a file, then
triggers a balance and then after 2 seconds it tries to cancel the
balance operation. More often than not, this works because the balance
is still running after 2 seconds. However it also fails sporadically
because balance has finished in less than 2 seconds, which is plausible
since data and metadata are cached or other factors such as virtualized
environment. When that's the case, it fails like this:

   $ ./check btrfs/213
   FSTYP         -- btrfs
   PLATFORM      -- Linux/x86_64 debian0 6.4.0-rc1-btrfs-next-131+ #1 SMP PREEMPT_DYNAMIC Thu May 11 11:26:19 WEST 2023
   MKFS_OPTIONS  -- /dev/sdc
   MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

   btrfs/213 51s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad)
       --- tests/btrfs/213.out	2020-06-10 19:29:03.822519250 +0100
       +++ /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad	2023-05-17 15:39:32.653727223 +0100
       @@ -1,2 +1,3 @@
        QA output created by 213
       +ERROR: balance cancel on '/home/fdmanana/btrfs-tests/scratch_1' failed: Not in progress
        Silence is golden
       ...
       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/213.out /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad'  to see the entire diff)
   Ran: btrfs/213
   Failures: btrfs/213
   Failed 1 of 1 tests

To make it much less likely that balance has already finished before we
try to cancel it, unmount and mount again the filesystem before starting
balance, to clear cached metadata and data, and also double the time we
spend writing 1M data extents. Also ignore when the balance failed because
it was already finished when we tried to cancel it.

Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
---
  tests/btrfs/213 | 13 ++++++++++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tests/btrfs/213 b/tests/btrfs/213
index 8a10355c..cca0b3cc 100755
--- a/tests/btrfs/213
+++ b/tests/btrfs/213
@@ -28,7 +28,7 @@ _require_xfs_io_command pwrite -D
  _scratch_mkfs >> $seqres.full
  _scratch_mount
  
-runtime=4
+runtime=8
  
  # Create enough IO so that we need around $runtime seconds to relocate it.
  #
@@ -39,11 +39,18 @@ sleep $runtime
  kill $write_pid
  wait $write_pid
  
+# Unmount and mount again the fs to clear any cached data and metadata, so that
+# it's less likely balance has already finished when we try to cancel it below.
+_scratch_cycle_mount
+
  # Now balance should take at least $runtime seconds, we can cancel it at
  # $runtime/2 to ensure a success cancel.
  _run_btrfs_balance_start -d --bg "$SCRATCH_MNT"


-sleep $(($runtime / 2))
-$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT"
+sleep $(($runtime / 4))
+# It's possible that balance has already completed. It's unlikely but often
+# it may happen due to virtualization, caching and other factors, so ignore
+# any error about no balance currently running.
+$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" 2>&1 | grep -iv 'not in progress'

Cancel is an important step in this test case.
Why not call _notrun() if the test case fails to make sure
the balance is still in progress? This way, it provides
another opportunity to fix.

Thanks, Anand

  
  # Now check if we can finish relocating metadata, which should finish very
  # quickly.