Re: [PATCH 02/40] fstests: cleanup fsstress process management

Zorro Lang <zlang@xxxxxxxxxx> · Fri, 29 Nov 2024 12:03:54 +0800

On Wed, Nov 27, 2024 at 03:51:32PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Lots of tests run fsstress in the background and then have to kill
> it and/or provide special cleanup functions to kill the background
> fsstress processes. They typically use $KILLALL_PROG for this.
> 
> Use of killall is problematic for running multiple tests in parallel
> in that one test can kill other tests' processes.  However, because
> fsstress itself forks and runs children, there are very few avenues
> for shell scripts to ensure all the fsstress processes actually die.
> 
> With bash, it is especially nasty, because sending SIGTERM will
> result in bash outputting error messages ("Killed: ..." that will
> cause golden output mismatches and hence test failures. Hence we
> also need to be able to tell the main fstress process to die without
> triggering these messages.
> 
> To avoid the process tracking problems, we change to use pkill
> rather than killall (more options for process selection) and we
> stop using the $here/ltp/fsstress binary. Instead, we copy the
> $here/ltp/fsstress to $TEST_DIR/$seq.fsstress so that the test has
> a unique fsstress binary name. This allows the pkill filter to
> select just the fsstress processes the test has run. The fsstress
> binary name is held in _FSSTRESS_NAME, and the program to run is
> _FSSTRESS_PROG.
> 
> We also track the primary fsstress process ID, and store that in
> _FSSTRESS_PID. We do this so that we have a PID to wait against so
> that we don't return before the fsstress processes are dead. To this
> end, we add a SIGPIPE handler to the primary process so that it
> dying doesn't trigger bash 'killed' message output. We can
> send 'pkill -PIPE $_FSSTRESS_NAME' to all the fsstress processes and
> the primary process will then enter the "wait for children to die"
> processing loop before it exits. In this way, we can wait for the
> primary fsstress process and when it exits we know that all it's
> children have also finished and gone away. This makes killing
> fsstress invocations reliable and noise free.
> 
> This is accomplished by the helpers added to common/rc:
> 
> 	_run_fsstress
> 	_run_fsstress_bg
> 	_wait_for_fsstress
> 	_kill_fstress
> 
> This also means that all fsstress invocations now obey
> FSSTRESS_AVOID environment restrictions, many of which didn't.
> 
> We add a call to _kill_fstress into the generic _cleanup() function.
> This means that tests using fsstress don't need to add a special
> local _cleanup function just to call _kill_fsstress() so that
> background fsstress processes are killed when the user interrupts
> the tests with ctrl-c.
> 
> Further, killall in the _cleanup() function is often used to attempt
> to expedite killing of foreground execution fsstress processes. This
> doesn't actually work because of the way bash processes interupt
> signals. That is, it waits for the currently executing process to
> finish execution, then runs the trap function. Hence a foreground
> fsstress won't ever be interrupted by ctrl-c. By implementing
> _run_fsstress() as a background process and a wait call, the wait()
> call is interrupted by the signal and the cleanup trap is run
> immediately. Hence the fsstress processes are killed immediately and
> the test exits cleanly almost immediately.
> 
> The result of all this is common, clean handling of fsstress
> execution and termination. There are a few exceptions for special
> cases, but the vast majority of tests that run fsstress use the
> above four wrapper functions exclusively.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---

[snip]

> diff --git a/tests/btrfs/028 b/tests/btrfs/028
> index f64fc831d..85e42f31e 100755
> --- a/tests/btrfs/028
> +++ b/tests/btrfs/028
> @@ -32,8 +32,7 @@ args=`_scale_fsstress_args -z \
>  	-f fsync=10 -n 100000 -p 2 \
>  	-d $SCRATCH_MNT/stress_dir`
>  echo "Run fsstress $args" >>$seqres.full
> -$FSSTRESS_PROG $args >>$seqres.full &
> -fsstress_pid=$!
> +_run_fsstress_bg $args
>  
>  echo "Start balance" >>$seqres.full
>  _btrfs_stress_balance -d $SCRATCH_MNT >/dev/null 2>&1 &
> @@ -41,8 +40,7 @@ balance_pid=$!
>  
>  # 30s is enough to trigger bug
>  sleep $((30*$TIME_FACTOR))
> -kill $fsstress_pid &> /dev/null
> -wait $fsstress_pid &> /dev/null
> +_kill_fsstress
>  _btrfs_kill_stress_balance_pid $balance_pid

I'll fix the conflict of this patch with:
https://lore.kernel.org/fstests/20241129031846.rije5u7o5o64v6gv@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m8174317696134bdc64d27c2f3012f7f843135f91

by below change [1], if you have more suggestions, feel free to tell me.

Thanks,
Zorro

[1]

diff --git a/tests/btrfs/028 b/tests/btrfs/028
index 4c6396ddb..b41e1077a 100755
--- a/tests/btrfs/028
+++ b/tests/btrfs/028
@@ -20,10 +20,7 @@ _cleanup()
        if [ ! -z "$balance_pid" ]; then
                _btrfs_kill_stress_balance_pid $balance_pid
        fi
-       if [ ! -z "$fsstress_pid" ]; then
-               kill $fsstress_pid &> /dev/null
-               wait $fsstress_pid &> /dev/null
-       fi
+       _kill_fsstress
 }
 
 . ./common/filter
@@ -55,7 +52,7 @@ balance_pid=$!
 sleep $((30*$TIME_FACTOR))
 _kill_fsstress
 _btrfs_kill_stress_balance_pid $balance_pid
-unset fsstress_pid balance_pid
+unset balance_pid