Re: [PATCH 10/23] mkfs: don't hardcode log size

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 22 Jan 2025 09:05:18 +1100

On Tue, Jan 21, 2025 at 07:44:30AM -0500, Theodore Ts'o wrote:
> On Tue, Jan 21, 2025 at 02:58:25PM +1100, Dave Chinner wrote:
> > > +# Are there mkfs options to try to improve concurrency?
> > > +_scratch_mkfs_concurrency_options()
> > > +{
> > > +	local nr_cpus="$(( $1 * LOAD_FACTOR ))"
> > 
> > caller does not need to pass a number of CPUs. This function can
> > simply do:
> > 
> > 	local nr_cpus=$(getconf _NPROCESSORS_CONF)
> > 
> > And that will set concurrency to be "optimal" for the number of CPUs
> > in the machine the test is going to run on. That way tests don't
> > need to hard code some number that is going to be too large for
> > small systems and to small for large systems...
> 
> Hmm, but is this the right thing if you are using check-parallel?

Yes. The whole point of check-parallel is to run the tests in such a
way as to max out the resources of the test machine for the entire
test run. Everything that can be run concurrently should be run
concurrently, and we should not be cutting down on the concurrency
just because we are running check-parallel.

> If
> you are running multiple tests that are all running some kind of load
> or stress-testing antagonist at the same time, then having 3x to 5x
> the number of necessary antagonist threads is going to unnecessarily
> slow down the test run, which goes against the original goal of what
> we were hoping to achieve with check-parallel.

There are tests that run a thousand concurrent fsstress processes -
check-parallel still runs all those thousand fsstress processes.

> How many tests are you currently able to run in parallel today,

All of them if I wanted. Default is to run one test per CPU at a
time, but also to allow tests that use concurrency to maximise it.

> and
> what's the ultimate goal?

My initial goal was to maximise the utilisation of the machine when
testing XFS. If I can't max out a 64p server with 1.5 million
IOPS/7GB/s IO and 64GB RAM with check-parallel, then check-parallel
is not running enough tests in parallel.

Right now with 64 runner threads (one per CPU), I'm seeing an
average utilisation across the whole auto group XFS test run of:

-50 CPUs
- 2.5GB/s IO @ 30k IOPS
- 40GB RAM

The utilisation on ext4 is much lower and runtimes are much longer
for (as yet) unknown reasons. Concurrent fsstress loads, in
particular, appear to run much slower on ext4 than XFS...

> We could have some kind of antagonist load
> which is shared across multiple tests, but it's not clear to me that
> it's worth the complexity.

Yes, that's the plan further down the track - stuff like background
CPU hotplug (instead of a test that specifically runs hotplug with
fsstress that takes about 5 minutes to run), cache dropping to add
memory reclaim during tests, etc

> (And note that it's not just fs and cpu
> load antagonistsw; there could also be memory stress antagonists, where
> having multiple antagonists could lead to OOM kills...)

Yes, I eventually plan to use the src/usemem.c memory locker to
create changing levels of background memory stress to the test
runs...

Right now "perturbations" are exercised as a side effect of random
tests performing these actions. I want to make them controllable by
check-parallel so we can exercise the system functionality across
the entire range of correctness tests we have, not just an isolated
test case.

IOWs, the whole point of check-parallel is to make use of large
machines to stress the whole OS at the same time as we are testing
for filesystem behavioural correctness.

I also want to do it in as short a time period as possible - outside
of dedicated QE environments, I don't beleive that long running
stress tests actually provide value for the machine time they
consume. i.e. returns rapidly diminish because stress tests
cover 99.99% of the code paths they are going to exercise in the
first few minutes of running.

Yes, letting them run for longer will -eventually- cover rarely
travelled code paths, but for developers, CI systems and
first/second level QE verification of bug fixes we don't need
extended stress tests.

Further, when we run fstests in the normal way, we never cover
things like memory reclaim racing against unmount, freeze, sync,
etc. And we never cover them when the system is under extremely
heavy load running multiple GB/s of IO whilst CPU hotplug is running
whilst the scheduler is running at nearly a million context
switches/s, etc.

That's exactly the sort of loads that check-parallel is generating
on a machine just running the correctness tests in parallel. It
combines correctness testing with a dynamic, stressful environment,
and it runs the tests -fast-. The coverage I get in a single 10
minute auto-group run of check-parallel is -much higher- than I get
in a single auto-group run of check that takes 4 hours on the same
hardware to complete....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx