From: Dave Chinner <dchinner@xxxxxxxxxx> Runs tests in parallel runner threads. Each runner thread has it's own set of tests to run, and runs a separate instance of check to run those tests. check-parallel sets up loop devices, mount points, results directories, etc for each instance and divides the tests up between the runner threads. It currently hard codes the XFS and generic test lists, and then gives each check invocation an explicit list of tests to run. It also passes through exclusions so that test exclude filtering is still done by check. This is far from ideal, but I didn't want to have to embark on a major refactoring of check to be able to run stuff in parallel. It was quite the challenge just to get all the tests and test infrastructure up to the point where they can run reliably in parallel. Hence I've left the actual factoring of test selection and setup out of the patchset for the moment. The plan is to factor both the test setup and the test list runner loop out of check and share them between check and check-parallel, hence not requiring check-parallel to run check directly. That is future work, however. With the current test runner setup, it is not uncommon to see >5000% cpu usage, 150-200kiops and 4-5GB/s of disk bandwidth being used when running 64 runners. This is a serious stress load as it is constantly mounting and unmounting dozens of filesystems, creating and destroying devices, dropping caches, running sync, running CPU hot plug, running page cache migration, etc. The massive amount of IO that load generates causes qemu hosts to abort (i.e. crash) because they run out of vm map segments. Hence bumping up the max_map_count on the host like so: echo 1048576 > /proc/sys/vm/max_map_count is necessary. There is no significant memory pressure to speak of from running the tests like this. I've seen a maximum of about 50GB of RAM used when running tests like this, so running on a 64p/64GB VM the additional concurrency doesn't really stress memory capacity like it does CPU and IO. All the runners are executed in private mount namespaces. This is to prevent ephemeral mount namespace clones from taking a reference to every mounted filesystem in the machine and so causing random "device busy after unmount" failures in the tests that are running concurrently with the mount namespace setup and teardown. A typical `pstree -N mnt` looks like: $ pstree -N mnt [4026531841] bash bash───pstree [0] sudo───sudo───check-parallel─┬─check-parallel───nsexec───check───311─┬─cut │ └─md5sum ├─check-parallel───nsexec───check───750─┬─750───sleep │ └─750.fsstress───4*[750.fsstress───{750.fsstress}] ├─check-parallel───nsexec───check───013───013───sed ├─check-parallel───nsexec───check───251───cp ├─check-parallel───nsexec───check───467───open_by_handle ├─check-parallel───nsexec───check───650─┬─650───sleep │ └─650.fsstress─┬─61*[650.fsstress───{650.fsstress}] │ └─2*[650.fsstress] ├─check-parallel───nsexec───check───707 ├─check-parallel───nsexec───check───705 ├─check-parallel───nsexec───check───416 ├─check-parallel───nsexec───check───477───2*[open_by_handle] ├─check-parallel───nsexec───check───140───140 ├─check-parallel───nsexec───check───562 ├─check-parallel───nsexec───check───415───xfs_io───{xfs_io} ├─check-parallel───nsexec───check───291 ├─check-parallel───nsexec───check───017 ├─check-parallel───nsexec───check───016 ├─check-parallel───nsexec───check───168───2*[168───168] ├─check-parallel───nsexec───check───672───2*[672───672] ├─check-parallel───nsexec───check───170─┬─170───170───170 │ └─170───170 ├─check-parallel───nsexec───check───531───122*[t_open_tmpfiles] ├─check-parallel───nsexec───check───387 ├─check-parallel───nsexec───check───748 ├─check-parallel───nsexec───check───388─┬─388.fsstress───4*[388.fsstress───{388.fsstress}] │ └─sleep ├─check-parallel───nsexec───check───328───328 ├─check-parallel───nsexec───check───352 ├─check-parallel───nsexec───check───042 ├─check-parallel───nsexec───check───426───open_by_handle ├─check-parallel───nsexec───check───756───2*[open_by_handle] ├─check-parallel───nsexec───check───227 ├─check-parallel───nsexec───check───208───aio-dio-invalid───2*[aio-dio-invalid] ├─check-parallel───nsexec───check───746───cp ├─check-parallel───nsexec───check───187───187 ├─check-parallel───nsexec───check───027───8*[027] ├─check-parallel───nsexec───check───045───xfs_io───{xfs_io} ├─check-parallel───nsexec───check───044 ├─check-parallel───nsexec───check───204 ├─check-parallel───nsexec───check───186───186 ├─check-parallel───nsexec───check───449 ├─check-parallel───nsexec───check───231───su───fsx ├─check-parallel───nsexec───check───509 ├─check-parallel───nsexec───check───127───5*[127───fsx] ├─check-parallel───nsexec───check───047 ├─check-parallel───nsexec───check───043 ├─check-parallel───nsexec───check───475───pkill ├─check-parallel───nsexec───check───299─┬─fio─┬─4*[fio] │ │ ├─2*[fio───4*[{fio}]] │ │ └─{fio} │ └─pgrep ├─check-parallel───nsexec───check───551───aio-dio-write-v ├─check-parallel───nsexec───check───323───aio-last-ref-he───100*[{aio-last-ref-he}] ├─check-parallel───nsexec───check───648───sleep ├─check-parallel───nsexec───check───046 ├─check-parallel───nsexec───check───753─┬─753.fsstress───4*[753.fsstress] │ └─pkill ├─check-parallel───nsexec───check───507───507 ├─check-parallel───nsexec───check───629─┬─3*[629───xfs_io───{xfs_io}] │ └─5*[629] ├─check-parallel───nsexec───check───073───umount ├─check-parallel───nsexec───check───615───615 ├─check-parallel───nsexec───check───176───punch-alternati ├─check-parallel───nsexec───check───294 ├─check-parallel───nsexec───check───236───236 ├─check-parallel───nsexec───check───165─┬─165─┬─165─┬─cut │ │ │ └─xfs_io───{xfs_io} │ │ └─165───grep │ └─165 ├─check-parallel───nsexec───check───259───sync ├─check-parallel───nsexec───check───442───442.fsstress───4*[442.fsstress───{442.fsstress}] ├─check-parallel───nsexec───check───558───255*[558] ├─check-parallel───nsexec───check───358───358───358 ├─check-parallel───nsexec───check───169───169 └─check-parallel───nsexec───check───297─┬─297.fsstress─┬─284*[297.fsstress───{297.fsstress}] │ └─716*[297.fsstress] └─sleep A typical test run looks like: $ time sudo ./check-parallel /mnt/xfs -s xfs -x dump Runner 63 Failures: xfs/170 Runner 36 Failures: xfs/050 Runner 30 Failures: xfs/273 Runner 29 Failures: generic/135 Runner 25 Failures: generic/603 Tests run: 1140 Failure count: 5 Ten slowest tests - runtime in seconds: xfs/013 454 generic/707 414 generic/017 398 generic/387 395 generic/748 390 xfs/140 351 generic/562 351 generic/705 347 generic/251 344 xfs/016 343 Cleanup on Aisle 5? total 0 crw-------. 1 root root 10, 236 Nov 27 09:27 control lrwxrwxrwx. 1 root root 7 Nov 27 09:27 fast -> ../dm-0 /dev/mapper/fast 1.4T 192G 1.2T 14% /mnt/xfs real 9m29.056s user 0m0.005s sys 0m0.022s $ Yeah, that runtime is real - under 10 minutes for a full XFS auto group test run. When running this normally (i.e. via check) on this machine, it usually takes just under 4 hours to run the same set of tests. i.e. I can run ./check-parallel roughly 25x times on this machine in the same time it takes to run ./check. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- check | 7 +- check-parallel | 205 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 208 insertions(+), 4 deletions(-) create mode 100755 check-parallel diff --git a/check b/check index 8131f4e2e..607d2456e 100755 --- a/check +++ b/check @@ -33,7 +33,7 @@ exclude_tests=() _err_msg="" # start the initialisation work now -iam=check +iam=check.$$ # mkfs.xfs uses the presence of both of these variables to enable formerly # supported tiny filesystem configurations that fstests use for fuzz testing @@ -460,7 +460,7 @@ fi _wrapup() { - seq="check" + seq="check.$$" check="$RESULT_BASE/check" $interrupt && sect_stop=`_wallclock` @@ -552,7 +552,6 @@ _wrapup() sum_bad=`expr $sum_bad + ${#bad[*]}` _wipe_counters - rm -f /tmp/*.rawout /tmp/*.out /tmp/*.err /tmp/*.time if ! $OPTIONS_HAVE_SECTIONS; then rm -f $tmp.* fi @@ -808,7 +807,7 @@ function run_section() init_rc - seq="check" + seq="check.$$" check="$RESULT_BASE/check" # don't leave old full output behind on a clean run diff --git a/check-parallel b/check-parallel new file mode 100755 index 000000000..c85437252 --- /dev/null +++ b/check-parallel @@ -0,0 +1,205 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024 Red Hat, Inc. All Rights Reserved. +# +# Run all tests in parallel +# +# This is a massive resource bomb script. For every test, it creates a +# pair of sparse loop devices for test and scratch devices, then mount points +# for them and runs the test in the background. When it completes, it tears down +# the loop devices. + +export SRC_DIR="tests" +basedir=$1 +shift +check_args="$*" +runners=64 +runner_list=() +runtimes=() + + +# tests in auto group +test_list=$(awk '/^[0-9].*auto/ { print "generic/" $1 }' tests/generic/group.list) +test_list+=$(awk '/^[0-9].*auto/ { print "xfs/" $1 }' tests/xfs/group.list) + +# grab all previously run tests and order them from highest runtime to lowest +# We are going to try to run the longer tests first, hopefully so we can avoid +# massive thundering herds trying to run lots of really short tests in parallel +# right off the bat. This will also tend to vary the order of tests from run to +# run somewhat. +# +# If we have tests in the test list that don't have runtimes recorded, then +# append them to be run last. + +build_runner_list() +{ + local runtimes + local run_list=() + local prev_results=`ls -tr $basedir/runner-0/ | grep results | tail -1` + + runtimes=$(cat $basedir/*/$prev_results/check.time | sort -k 2 -nr | cut -d " " -f 1) + + # Iterate the timed list first. For every timed list entry that + # is found in the test_list, add it to the local runner list. + local -a _list=( $runtimes ) + local -a _tlist=( $test_list ) + local rx=0 + local ix + local jx + #set -x + for ((ix = 0; ix < ${#_list[*]}; ix++)); do + echo $test_list | grep -q ${_list[$ix]} + if [ $? == 0 ]; then + # add the test to the new run list and remove + # it from the remaining test list. + run_list[rx++]=${_list[$ix]} + _tlist=( ${_tlist[*]/${_list[$ix]}/} ) + fi + + done + + # The final test list is all the time ordered tests followed by + # all the tests we didn't find time records for. + test_list="${run_list[*]} ${_tlist[*]}" +} + +if [ -f $basedir/runner-0/results/check.time ]; then + build_runner_list +fi + +# split the list amongst N runners + +split_runner_list() +{ + local ix + local rx + local -a _list=( $test_list ) + for ((ix = 0; ix < ${#_list[*]}; ix++)); do + seq="${_list[$ix]}" + rx=$((ix % $runners)) + runner_list[$rx]+="${_list[$ix]} " + #echo $seq + done +} + +_create_loop_device() +{ + local file=$1 dev + + dev=`losetup -f --show $file` || _fail "Cannot assign $file to a loop device" + + # Using buffered IO for the loop devices seems to run quite a bit + # faster. There are a lot of tests that hit the same regions of the + # filesystems, so avoiding read IO seems to really help. Results can + # vary, though, because many tests drop all caches unconditionally. + # Uncomment to use AIO+DIO loop devices instead. + #test -b "$dev" && losetup --direct-io=on $dev 2> /dev/null + + echo $dev +} + +_destroy_loop_device() +{ + local dev=$1 + blockdev --flushbufs $dev + umount $dev > /dev/null 2>&1 + losetup -d $dev || _fail "Cannot destroy loop device $dev" +} + +runner_go() +{ + local id=$1 + local me=$basedir/runner-$id + local _test=$me/test.img + local _scratch=$me/scratch.img + local _results=$me/results-$2 + + mkdir -p $me + + xfs_io -f -c 'truncate 2g' $_test + xfs_io -f -c 'truncate 8g' $_scratch + + mkfs.xfs -f $_test > /dev/null 2>&1 + + export TEST_DEV=$(_create_loop_device $_test) + export TEST_DIR=$me/test + export SCRATCH_DEV=$(_create_loop_device $_scratch) + export SCRATCH_MNT=$me/scratch + export FSTYP=xfs + export RESULT_BASE=$_results + + mkdir -p $TEST_DIR + mkdir -p $SCRATCH_MNT + mkdir -p $RESULT_BASE + rm -f $RESULT_BASE/check.* + +# export DUMP_CORRUPT_FS=1 + + # Run the tests in it's own mount namespace, as per the comment below + # that precedes making the basedir a private mount. + ./src/nsexec -m ./check $check_args -x unreliable_in_parallel --exact-order ${runner_list[$id]} > $me/log 2>&1 + + wait + sleep 1 + umount -R $TEST_DIR 2> /dev/null + umount -R $SCRATCH_MNT 2> /dev/null + _destroy_loop_device $TEST_DEV + _destroy_loop_device $SCRATCH_DEV + + grep -q Failures: $me/log + if [ $? -eq 0 ]; then + echo -n "Runner $id Failures: " + grep Failures: $me/log | uniq | sed -e "s/^.*Failures://" + fi + +} + +cleanup() +{ + killall -INT -q check + wait + umount -R $basedir/*/test 2> /dev/null + umount -R $basedir/*/scratch 2> /dev/null + losetup --detach-all +} + +trap "cleanup; exit" HUP INT QUIT TERM + + +# Each parallel test runner needs to only see it's own mount points. If we +# leave the basedir as shared, then all tests see all mounts and then we get +# mount propagation issues cropping up. For example, cloning a new mount +# namespace will take a reference to all visible shared mounts and hold them +# while the mount names space is active. This can cause unmount in the test that +# controls the mount to succeed without actually unmounting the filesytsem +# because a mount namespace still holds a reference to it. This causes other +# operations on the block device to fail as it is still busy (e.g. fsck, mkfs, +# etc). Hence we make the basedir private here and then run each check instance +# in it's own mount namespace so that they cannot see mounts that other tests +# are performing. +mount --make-private $basedir +split_runner_list +now=`date +%Y-%m-%d-%H:%M:%S` +for ((i = 0; i < $runners; i++)); do + + runner_go $i $now & + +done; +wait + +echo -n "Tests run: " +grep Ran /mnt/xfs/*/log | sed -e 's,^.*:,,' -e 's, ,\n,g' | sort | uniq | wc -l + +echo -n "Failure count: " +grep Failures: $basedir/*/log | uniq | sed -e "s/^.*Failures://" -e "s,\([0-9]\) \([gx]\),\1\n \2,g" |wc -l +echo + +echo Ten slowest tests - runtime in seconds: +cat $basedir/*/results/check.time | sort -k 2 -nr | head -10 + +echo +echo Cleanup on Aisle 5? +echo +losetup --list +ls -l /dev/mapper +df -h |grep xfs -- 2.45.2