Re: [PATCH 08/23] common: fix pkill by running test program in a separate session

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 28 Jan 2025 15:34:50 +1100

On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote:
> > > > Agreed, though at some point after these bugfixes are merged I'll see if
> > > > I can build on the existing "if you have systemd then ___ else here's
> > > > your shabby opencoded version" logic in fstests to isolate the ./checks
> > > > from each other a little better.  It'd be kinda nice if we could
> > > > actually just put them in something resembling a modernish container,
> > > > albeit with the same underlying fs.
> > > 
> > > Agreed, but I don't think we need to depend on systemd for that,
> > > either.
> > > 
> > > > <shrug> Anyone else interested in that?
> > > 
> > > check-parallel has already started down that road with the
> > > mount namespace isolation it uses for the runner tasks via
> > > src/nsexec.c.
.....
> > > Hmmm - looks like src/nsexec.c can create new PID namespaces via
> > > the "-p" option. I haven't used that before - I wonder if that's a
> > > better solution that using per-test session IDs to solve the pkill
> > > --parent problem? Something to look into in the morning....

.....

> Note, however, that ps will show all processes in both the parent
> and the child namespace the shell is running on because the contents
> of /proc are the same for both.
> 
> However, because we are also using private mount namespaces for the
> check process, pid_namespaces(7) tells us:
> 
> /proc and PID namespaces
> 
>        A /proc filesystem shows (in the /proc/pid directories) only
>        processes visible in the PID namespace of the process that
>        performed the mount, even if the /proc filesystem is viewed
>        from processes in other namespaces.
> 
>        After creating a new PID namespace, it is useful for the
>        child to change its root directory and mount a new procfs
>        instance at /proc so that tools  such  as  ps(1) work
> >>>    correctly.  If a new mount namespace is simultaneously
> >>>    created by including CLONE_NEWNS in the flags argument of
> >>>    clone(2) or unshare(2), then it isn't necessary to change the
> >>>    root directory: a new procfs instance can be mounted directly
> >>>    over /proc.
> 
>        From a shell, the command to mount /proc is:
> 
>            $ mount -t proc proc /proc
> 
>        Calling readlink(2) on the path /proc/self yields the process
>        ID of the caller in the PID namespace of the procfs mount
>        (i.e., the PID name‐ space of the process that mounted the
>        procfs).  This can be useful for introspection purposes, when
>        a process wants to discover its  PID  in other namespaces.
> 
> This appears to give us an environment that only shows the processes
> within the current PID namespace:
> 
> $ sudo src/nsexec -p -m bash
> # mount -t proc proc /proc
> # ps waux
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root           1  0.0  0.0   7384  3744 pts/1    S    11:55   0:00 bash
> root          72  0.0  0.0   8300  3736 pts/1    R+   12:04   0:00 ps waux
> # pstree -N pid
> [4026538173]
> bash───pstree
> #
> 
> Yup, there we go - we have full PID isolation for this shell.

Ok, it took me some time to get this to work reliably - the way was
full of landmines with little documentation to guide around them.

1. If multiple commands are needed to be run from nsexec, a helper
   script is needed (I called it check-helper).

2. You have to `mount --make-private /proc` before doing anything to
   make the mounts of /proc inside the pid namespace work correctly.
   If you don't do this, every other mount namespace will also see
   only the newest mount, and that means every runner but the last
   one to start with the wrong mount and PID namespace view in /proc.

3. if you get the system into the state described by 1), unmounting
   /proc in each runner then races to remove the top /proc mount.
   Many threads get -EBUSY failures from unmount, leaving many stale
   mounts of /proc behind.

4. the top level shell where check-parallel was invoked is left with
   the view where /proc has been unmounted from a PID/mount
   namespace that has gone away. Hence /proc now has no processes or
   mounts being reported and nothing works until you mount a new
   instance /proc in that namespace.

5. After mounting proc again there are lots of mounts of stale /proc
   mounts still reported. They cannot be unmounted as the mount
   namespaces that created them have gone away and unmounting /proc
   in the current shell simply removes the last mounted one and we
   goto 4).

4. /tmp is still shared across all runner instances so all the

   concurrent runners dump all their tmp files in /tmp. However, the
   runners no longer have unique PIDs (i.e. check runs as PID 3 in
   all runner instaces). This means using /tmp/tmp.$$ as the
   check/test temp file definition results is instant tmp file name
   collisions and random things in check and tests fail.  check and
   common/preamble have to be converted to use `mktemp` to provide
   unique tempfile name prefixes again.

5. Don't forget to revert the parent /proc mount back to shared
   after check has finished running (or was aborted).

I think with this (current prototype patch below), we can use PID
namespaces rather than process session IDs for check-parallel safe
process management.

Thoughts?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx


check-parallel: use PID namespaces for runner process isolation

From: Dave Chinner <dchinner@xxxxxxxxxx>

This provides isolation between individual runners so that they
cannot see the processes that other test runners have created.
This means tools like pkill will only find processes run by the test
that is calling it, hence there is no danger taht it might kill
processes owned by a different test in a different runner context.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 check           |  6 +++++-
 check-helper    | 23 +++++++++++++++++++++++
 check-parallel  | 25 +++++++++++++++++++------
 common/preamble |  5 ++++-
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/check b/check
index 4dc266dcf..314436667 100755
--- a/check
+++ b/check
@@ -4,7 +4,11 @@
 #
 # Control script for QA
 #
-tmp=/tmp/$$
+
+# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier.
+# Use mktemp instead.
+tmp=`mktemp`
+
 status=0
 needwrap=true
 needsum=true
diff --git a/check-helper b/check-helper
new file mode 100755
index 000000000..47a92de8b
--- /dev/null
+++ b/check-helper
@@ -0,0 +1,23 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025 Red Hat, Inc.  All Rights Reserved.
+#
+
+# When check is run from check-parallel, it is run in private mount and PID
+# namespaces. We need to set up /proc to only show processes in this PID
+# namespace, so we mount a new instance of /proc over the top of the existing
+# version. Because we are in a private mount namespace, the does not propagate
+# outside this context and hence does not conflict with the other test runners
+# that are performing the same setup actions.
+
+args="$@"
+
+#echo $args
+mount -t proc proc /proc
+ret=$?
+if [ $ret -eq 0 ]; then
+	./check $args
+	umount -l /proc
+else
+	echo failed to mount /proc, ret $ret!
+fi
diff --git a/check-parallel b/check-parallel
index 2fbf0fdbe..6082baf5e 100755
--- a/check-parallel
+++ b/check-parallel
@@ -259,14 +259,14 @@ runner_go()
 	rm -f $RESULT_BASE/check.*
 
 	# Only supports default mkfs parameters right now
-	wipefs -a $TEST_DEV > /dev/null 2>&1
-	yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV > /dev/null 2>&1
+	wipefs -a $TEST_DEV > $me/log 2>&1
+	yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV >> $me/log 2>&1
 
 #	export DUMP_CORRUPT_FS=1
 
-	# Run the tests in it's own mount namespace, as per the comment below
-	# that precedes making the basedir a private mount.
-	./src/nsexec -m ./check $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} > $me/log 2>&1
+	# Run the tests in it's own mount and PID namespace, as per the comment
+	# below that precedes making the basedir a private mount.
+	./src/nsexec -m -p ./check-helper $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} >> $me/log 2>&1
 
 	wait
 	sleep 1
@@ -291,6 +291,8 @@ cleanup()
 	umount -R $basedir/*/test 2> /dev/null
 	umount -R $basedir/*/scratch 2> /dev/null
 	losetup --detach-all
+	mount --make-shared /proc
+	mount --make-shared $basedir
 }
 
 trap "cleanup; exit" HUP INT QUIT TERM
@@ -311,10 +313,17 @@ fi
 # controls the mount to succeed without actually unmounting the filesytsem
 # because a mount namespace still holds a reference to it. This causes other
 # operations on the block device to fail as it is still busy (e.g. fsck, mkfs,
-# etc). Hence we make the basedir private here and then run each check instance
+# etc).
+#
+# Hence we make the basedir private here and then run each check instance
 # in it's own mount namespace so that they cannot see mounts that other tests
 # are performing.
+#
+# We also need to make /proc private so that the runners can be run cleanly in
+# a PID namespace. This requires an new mount of /proc inside the PID namespace,
+# and this requires a private mount namespace to work correctly.
 mount --make-private $basedir
+mount --make-private /proc
 
 now=`date +%Y-%m-%d-%H:%M:%S`
 for ((i = 0; i < $runners; i++)); do
@@ -328,6 +337,10 @@ for ((i = 0; i < $runners; i++)); do
 done;
 wait
 
+# Restore the parents back to shared mount namespace behaviour.
+mount --make-shared /proc
+mount --make-shared $basedir
+
 if [ -n "$show_test_list" ]; then
 	exit 0
 fi
diff --git a/common/preamble b/common/preamble
index 78e45d522..8f47b172a 100644
--- a/common/preamble
+++ b/common/preamble
@@ -43,8 +43,11 @@ _begin_fstest()
 	seqres=$RESULT_DIR/$seq
 	echo "QA output created by $seq"
 
+	# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier.
+	# Use mktemp instead.
+	tmp=`mktemp`
+
 	here=`pwd`
-	tmp=/tmp/$$
 	status=1	# failure is the default!
 
 	_register_cleanup _cleanup