On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote: > > > > > Agreed, though at some point after these bugfixes are merged I'll see if > > > > > I can build on the existing "if you have systemd then ___ else here's > > > > > your shabby opencoded version" logic in fstests to isolate the ./checks > > > > > from each other a little better. It'd be kinda nice if we could > > > > > actually just put them in something resembling a modernish container, > > > > > albeit with the same underlying fs. > > > > > > > > Agreed, but I don't think we need to depend on systemd for that, > > > > either. > > > > > > > > > <shrug> Anyone else interested in that? > > > > > > > > check-parallel has already started down that road with the > > > > mount namespace isolation it uses for the runner tasks via > > > > src/nsexec.c. > ..... > > > > Hmmm - looks like src/nsexec.c can create new PID namespaces via > > > > the "-p" option. I haven't used that before - I wonder if that's a > > > > better solution that using per-test session IDs to solve the pkill > > > > --parent problem? Something to look into in the morning.... > > ..... > > > Note, however, that ps will show all processes in both the parent > > and the child namespace the shell is running on because the contents > > of /proc are the same for both. > > > > However, because we are also using private mount namespaces for the > > check process, pid_namespaces(7) tells us: > > > > /proc and PID namespaces > > > > A /proc filesystem shows (in the /proc/pid directories) only > > processes visible in the PID namespace of the process that > > performed the mount, even if the /proc filesystem is viewed > > from processes in other namespaces. > > > > After creating a new PID namespace, it is useful for the > > child to change its root directory and mount a new procfs > > instance at /proc so that tools such as ps(1) work > > >>> correctly. If a new mount namespace is simultaneously > > >>> created by including CLONE_NEWNS in the flags argument of > > >>> clone(2) or unshare(2), then it isn't necessary to change the > > >>> root directory: a new procfs instance can be mounted directly > > >>> over /proc. > > > > From a shell, the command to mount /proc is: > > > > $ mount -t proc proc /proc > > > > Calling readlink(2) on the path /proc/self yields the process > > ID of the caller in the PID namespace of the procfs mount > > (i.e., the PID name‐ space of the process that mounted the > > procfs). This can be useful for introspection purposes, when > > a process wants to discover its PID in other namespaces. > > > > This appears to give us an environment that only shows the processes > > within the current PID namespace: > > > > $ sudo src/nsexec -p -m bash > > # mount -t proc proc /proc > > # ps waux > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > root 1 0.0 0.0 7384 3744 pts/1 S 11:55 0:00 bash > > root 72 0.0 0.0 8300 3736 pts/1 R+ 12:04 0:00 ps waux > > # pstree -N pid > > [4026538173] > > bash───pstree > > # > > > > Yup, there we go - we have full PID isolation for this shell. > > Ok, it took me some time to get this to work reliably - the way was > full of landmines with little documentation to guide around them. > > 1. If multiple commands are needed to be run from nsexec, a helper > script is needed (I called it check-helper). > > 2. You have to `mount --make-private /proc` before doing anything to > make the mounts of /proc inside the pid namespace work correctly. > If you don't do this, every other mount namespace will also see > only the newest mount, and that means every runner but the last > one to start with the wrong mount and PID namespace view in /proc. > > 3. if you get the system into the state described by 1), unmounting > /proc in each runner then races to remove the top /proc mount. > Many threads get -EBUSY failures from unmount, leaving many stale > mounts of /proc behind. > > 4. the top level shell where check-parallel was invoked is left with > the view where /proc has been unmounted from a PID/mount > namespace that has gone away. Hence /proc now has no processes or > mounts being reported and nothing works until you mount a new > instance /proc in that namespace. > > 5. After mounting proc again there are lots of mounts of stale /proc > mounts still reported. They cannot be unmounted as the mount > namespaces that created them have gone away and unmounting /proc > in the current shell simply removes the last mounted one and we > goto 4). > > 4. /tmp is still shared across all runner instances so all the > > concurrent runners dump all their tmp files in /tmp. However, the > runners no longer have unique PIDs (i.e. check runs as PID 3 in > all runner instaces). This means using /tmp/tmp.$$ as the > check/test temp file definition results is instant tmp file name > collisions and random things in check and tests fail. check and > common/preamble have to be converted to use `mktemp` to provide > unique tempfile name prefixes again. > > 5. Don't forget to revert the parent /proc mount back to shared > after check has finished running (or was aborted). > > I think with this (current prototype patch below), we can use PID > namespaces rather than process session IDs for check-parallel safe > process management. > > Thoughts? Was about to go to bed, but can we also start a new mount namespace, create a private (or at least non-global) /tmp to put files into, and then each test instance is isolated from clobbering the /tmpfiles of other ./check instances *and* the rest of the system? --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > > > check-parallel: use PID namespaces for runner process isolation > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > This provides isolation between individual runners so that they > cannot see the processes that other test runners have created. > This means tools like pkill will only find processes run by the test > that is calling it, hence there is no danger taht it might kill > processes owned by a different test in a different runner context. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > check | 6 +++++- > check-helper | 23 +++++++++++++++++++++++ > check-parallel | 25 +++++++++++++++++++------ > common/preamble | 5 ++++- > 4 files changed, 51 insertions(+), 8 deletions(-) > > diff --git a/check b/check > index 4dc266dcf..314436667 100755 > --- a/check > +++ b/check > @@ -4,7 +4,11 @@ > # > # Control script for QA > # > -tmp=/tmp/$$ > + > +# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier. > +# Use mktemp instead. > +tmp=`mktemp` > + > status=0 > needwrap=true > needsum=true > diff --git a/check-helper b/check-helper > new file mode 100755 > index 000000000..47a92de8b > --- /dev/null > +++ b/check-helper > @@ -0,0 +1,23 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (c) 2025 Red Hat, Inc. All Rights Reserved. > +# > + > +# When check is run from check-parallel, it is run in private mount and PID > +# namespaces. We need to set up /proc to only show processes in this PID > +# namespace, so we mount a new instance of /proc over the top of the existing > +# version. Because we are in a private mount namespace, the does not propagate > +# outside this context and hence does not conflict with the other test runners > +# that are performing the same setup actions. > + > +args="$@" > + > +#echo $args > +mount -t proc proc /proc > +ret=$? > +if [ $ret -eq 0 ]; then > + ./check $args > + umount -l /proc > +else > + echo failed to mount /proc, ret $ret! > +fi > diff --git a/check-parallel b/check-parallel > index 2fbf0fdbe..6082baf5e 100755 > --- a/check-parallel > +++ b/check-parallel > @@ -259,14 +259,14 @@ runner_go() > rm -f $RESULT_BASE/check.* > > # Only supports default mkfs parameters right now > - wipefs -a $TEST_DEV > /dev/null 2>&1 > - yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV > /dev/null 2>&1 > + wipefs -a $TEST_DEV > $me/log 2>&1 > + yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV >> $me/log 2>&1 > > # export DUMP_CORRUPT_FS=1 > > - # Run the tests in it's own mount namespace, as per the comment below > - # that precedes making the basedir a private mount. > - ./src/nsexec -m ./check $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} > $me/log 2>&1 > + # Run the tests in it's own mount and PID namespace, as per the comment > + # below that precedes making the basedir a private mount. > + ./src/nsexec -m -p ./check-helper $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} >> $me/log 2>&1 > > wait > sleep 1 > @@ -291,6 +291,8 @@ cleanup() > umount -R $basedir/*/test 2> /dev/null > umount -R $basedir/*/scratch 2> /dev/null > losetup --detach-all > + mount --make-shared /proc > + mount --make-shared $basedir > } > > trap "cleanup; exit" HUP INT QUIT TERM > @@ -311,10 +313,17 @@ fi > # controls the mount to succeed without actually unmounting the filesytsem > # because a mount namespace still holds a reference to it. This causes other > # operations on the block device to fail as it is still busy (e.g. fsck, mkfs, > -# etc). Hence we make the basedir private here and then run each check instance > +# etc). > +# > +# Hence we make the basedir private here and then run each check instance > # in it's own mount namespace so that they cannot see mounts that other tests > # are performing. > +# > +# We also need to make /proc private so that the runners can be run cleanly in > +# a PID namespace. This requires an new mount of /proc inside the PID namespace, > +# and this requires a private mount namespace to work correctly. > mount --make-private $basedir > +mount --make-private /proc > > now=`date +%Y-%m-%d-%H:%M:%S` > for ((i = 0; i < $runners; i++)); do > @@ -328,6 +337,10 @@ for ((i = 0; i < $runners; i++)); do > done; > wait > > +# Restore the parents back to shared mount namespace behaviour. > +mount --make-shared /proc > +mount --make-shared $basedir > + > if [ -n "$show_test_list" ]; then > exit 0 > fi > diff --git a/common/preamble b/common/preamble > index 78e45d522..8f47b172a 100644 > --- a/common/preamble > +++ b/common/preamble > @@ -43,8 +43,11 @@ _begin_fstest() > seqres=$RESULT_DIR/$seq > echo "QA output created by $seq" > > + # When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier. > + # Use mktemp instead. > + tmp=`mktemp` > + > here=`pwd` > - tmp=/tmp/$$ > status=1 # failure is the default! > > _register_cleanup _cleanup