Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 22 Jan 2025 15:12:11 +1100

On Tue, Jan 21, 2025 at 07:49:44PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 21, 2025 at 03:57:23PM +1100, Dave Chinner wrote:
> > On Thu, Jan 16, 2025 at 03:28:33PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Prior to commit 8973af00ec21, in the absence of an explicit
> > > SOAK_DURATION, this test would run 2500 fsstress operations each of ten
> > > times through the loop body.  On the author's machines, this kept the
> > > runtime to about 30s total.  Oddly, this was changed to 30s per loop
> > > body with no specific justification in the middle of an fsstress process
> > > management change.
> > 
> > I'm pretty sure that was because when you run g/650 on a machine
> > with 64p, the number of ops performed on the filesystem is
> > nr_cpus * 2500 * nr_loops.
> 
> Where does that happen?
> 
> Oh, heh.  -n is the number of ops *per process*.

Yeah, I just noticed another case of this:

Ten slowest tests - runtime in seconds:
generic/750 559
generic/311 486
.....

generic/750 does:

nr_cpus=$((LOAD_FACTOR * 4))
nr_ops=$((25000 * nr_cpus * TIME_FACTOR))
fsstress_args=(-w -d $SCRATCH_MNT -n $nr_ops -p $nr_cpus)

So the actual load factor increase is exponential:

Load factor	nr_cpus		nr_ops		total ops
1		4		100k		400k
2		8		200k		1.6M
3		12		300k		3.6M
4		16		400k		6.4M

and so on.

I suspect that there are other similar cpu scaling issues
lurking across the many fsstress tests...

> > > On the author's machine, this explodes the runtime from ~30s to 420s.
> > > Put things back the way they were.
> > 
> > Yeah, OK, that's exactly waht keep_running() does - duration
> > overrides nr_ops.
> > 
> > Ok, so keeping or reverting the change will simply make different
> > people unhappy because of the excessive runtime the test has at
> > either ends of the CPU count spectrum - what's the best way to go
> > about providing the desired min(nr_ops, max loop time) behaviour?
> > Do we simply cap the maximum process count to keep the number of ops
> > down to something reasonable (e.g. 16), or something else?
> 
> How about running fsstress with --duration=3 if SOAK_DURATION isn't set?
> That should keep the runtime to 30 seconds or so even on larger
> machines:
> 
> if [ -n "$SOAK_DURATION" ]; then
> 	test "$SOAK_DURATION" -lt 10 && SOAK_DURATION=10
> 	fsstress_args+=(--duration="$((SOAK_DURATION / 10))")
> else
> 	# run for 3s per iteration max for a default runtime of ~30s.
> 	fsstress_args+=(--duration=3)
> fi

Yeah, that works for me.

As a rainy day project, perhaps we should look to convert all the
fsstress invocations to be time bound rather than running a specific
number of ops. i.e. hard code nr_ops=<some huge number> in
_run_fstress_bg() and the tests only need to define parallelism and
runtime.

This would make the test runtimes more deterministic across machines
with vastly different capabilities and and largely make "test xyz is
slow on my test machine" reports largely go away.

Thoughts?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx