Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 21 Jan 2025 20:37:32 -0800

On Wed, Jan 22, 2025 at 03:12:11PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 07:49:44PM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 21, 2025 at 03:57:23PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 16, 2025 at 03:28:33PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > 
> > > > Prior to commit 8973af00ec21, in the absence of an explicit
> > > > SOAK_DURATION, this test would run 2500 fsstress operations each of ten
> > > > times through the loop body.  On the author's machines, this kept the
> > > > runtime to about 30s total.  Oddly, this was changed to 30s per loop
> > > > body with no specific justification in the middle of an fsstress process
> > > > management change.
> > > 
> > > I'm pretty sure that was because when you run g/650 on a machine
> > > with 64p, the number of ops performed on the filesystem is
> > > nr_cpus * 2500 * nr_loops.
> > 
> > Where does that happen?
> > 
> > Oh, heh.  -n is the number of ops *per process*.
> 
> Yeah, I just noticed another case of this:
> 
> Ten slowest tests - runtime in seconds:
> generic/750 559
> generic/311 486
> .....
> 
> generic/750 does:
> 
> nr_cpus=$((LOAD_FACTOR * 4))
> nr_ops=$((25000 * nr_cpus * TIME_FACTOR))
> fsstress_args=(-w -d $SCRATCH_MNT -n $nr_ops -p $nr_cpus)
> 
> So the actual load factor increase is exponential:
> 
> Load factor	nr_cpus		nr_ops		total ops
> 1		4		100k		400k
> 2		8		200k		1.6M
> 3		12		300k		3.6M
> 4		16		400k		6.4M
> 
> and so on.
> 
> I suspect that there are other similar cpu scaling issues
> lurking across the many fsstress tests...
> 
> > > > On the author's machine, this explodes the runtime from ~30s to 420s.
> > > > Put things back the way they were.
> > > 
> > > Yeah, OK, that's exactly waht keep_running() does - duration
> > > overrides nr_ops.
> > > 
> > > Ok, so keeping or reverting the change will simply make different
> > > people unhappy because of the excessive runtime the test has at
> > > either ends of the CPU count spectrum - what's the best way to go
> > > about providing the desired min(nr_ops, max loop time) behaviour?
> > > Do we simply cap the maximum process count to keep the number of ops
> > > down to something reasonable (e.g. 16), or something else?
> > 
> > How about running fsstress with --duration=3 if SOAK_DURATION isn't set?
> > That should keep the runtime to 30 seconds or so even on larger
> > machines:
> > 
> > if [ -n "$SOAK_DURATION" ]; then
> > 	test "$SOAK_DURATION" -lt 10 && SOAK_DURATION=10
> > 	fsstress_args+=(--duration="$((SOAK_DURATION / 10))")
> > else
> > 	# run for 3s per iteration max for a default runtime of ~30s.
> > 	fsstress_args+=(--duration=3)
> > fi
> 
> Yeah, that works for me.
> 
> As a rainy day project, perhaps we should look to convert all the
> fsstress invocations to be time bound rather than running a specific
> number of ops. i.e. hard code nr_ops=<some huge number> in
> _run_fstress_bg() and the tests only need to define parallelism and
> runtime.

I /think/ the only ones that do that are generic/1220 generic/476
generic/642 generic/750.  I could drop the nr_cpus term from the nr_ops
calculation.

> This would make the test runtimes more deterministic across machines
> with vastly different capabilities and and largely make "test xyz is
> slow on my test machine" reports largely go away.
> 
> Thoughts?

I'm fine with _run_fsstress injecting --duration=30 if no other duration
argument is passed in.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>