Re: Eric Whitney's ext4 scaling data

Eric Whitney <enwlinux@xxxxxxxxx> · Sun, 31 Mar 2013 23:43:52 -0400

* Dave Chinner <david@xxxxxxxxxxxxx>:
> On Wed, Mar 27, 2013 at 11:10:11AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 27, 2013 at 03:21:02PM +0800, Zheng Liu wrote:
> > > 
> > > The key issue that we add test case into xfstests is that we need to
> > > handle some filesystem-specific feature.  Just like we had discussed
> > > with Dave, what is an extent?  IMHO now xfstests gets more compliated
> > > because it needs to handle this problem. e.g. punch hole for
> > > indirect-based file in ext4.
> > 
> > Yes, that means among other things the test framework needs to keep
> > track of which file system features was being used when we run a
> > particular test, as well as the hardware configuration.
> > 
> > I suspect that what this means is that we're better off trying to
> > create a new test framework that does what we want, and automates as
> > much of this as possible.
> 
> Well, tracking the hardware, configuration, results over time, etc
> is really orthogonal to the benchmarking harness. We're already
> modifying xfstests to make it easier to do this sort of thing (like
> user specified results directories, configurable expunged files,
> etc) so that you can control and archive individual xfstests from a
> higher level automated harness.
> 
> So I don't see this a problem that a low level benchmarking
> framework needs to concern itself directly with - what you seem to
> be wanting is a better automation and archiving framework on top of
> the low level harness that runs the specific tests/benchmarks....
> 
> > It would probably be a good idea to bring in Eric Whitney into this
> > discussion, since he has a huge amount of expertise about what sort of
> > things need to be done in order to get good results.  He was doing a
> > number of things by hand, including re-running the tests multiple
> > times to make sure the results were stable.  I could imagine that if
> > the framework could keep track of what the standard deviation was for
> > a particular test, it could try to do this automatically, and then we
> > could also throw up a flag if the average result hadn't changed, but
> > the standard deviation had increased, since that might be an
> > indication that some change had caused a lot more variability.
> 
> Yup, you need to have result archives and post-process them to do
> this sort of thing, which is why I think it's a separate problem to
> that of actually defining and running the benchmarks...
> 

I think it's important to also consider building good tools to explore
and visualize the data.  The web pages in the tar archive I sent Ted are a
poor approximation, since their content was generated by hand rather than
automatically.  Instead, you might have a tool whose user interface is a
web page with links to all collected data sets in an archive, and filters
which could be used to select specific test systems, individual benchmarks,
and metrics of interest (including configuration info).  Once you select a
group of data sets, test systems, a benchmark, and a metric, the page
produces graphs or tables of data for comparison.

We built something like this at my previous employer, and it was invaluable 
(I'm sure similar things must have been done elsewhere).  It made it very
easy to quickly review a new incoming data set and compare it with older
data, to look for progressive changes over time, or to examine behavioral
differences across system configurations.  When you collect enough benchmark
data over time, fully exploiting all that information leaves you with a 
significant data mining problem.

It's helpful if benchmark and workload design supports analysis as well as
measurement.  I tend to like a layered approach where, for example, a base
layer might consist of a set of block layer microbenchmarks that help
characterize storage system performance.  A second layer would consist of
simple file system microbenchmarks - the usual sequential/random read/write
plus selected metadata operations, etc.  More elaborate workloads
representative of important use cases would sit on top of that.  Ideally,
it should be possible to relate changes in higher level workloads/benchmarks
to those below and vice versa.  For example, the block layer microbenchmarks
ought to help determine the maximum performance bounds for the file system
microbenchmarks, etc.  (fio ought to be suitable for the two lower levels in
this scheme;  more elaborate workloads might require some scripting around
fio or some new code.)

When working with benchmarks on a test system that can yield significant
variation, I do tend to like to take multiple sets and compare them.  This
could certainly be handled statistically;  my usual practice is to do this
manually so as to get a better feel for how the benchmark and the hardware
run together.  Ideally, more experience with the test configuration leads
to hardware reconfiguration or kernel tweaks that can yield more consistent
results (common on NUMA systems, for example).  Strong variation is sometimes
indication of a problem somewhere (in my experience, at least), so trying
to understand and reduce the variation sometimes leads to a useful fix.

FWIW, I used the Autotest client code for my ext4 work to run the benchmarks
and collect the data, system configuration particulars, run logs, etc.  Most
of what I had to do involved scripting test scenarios that would run selected
sets of benchmarks in predefined test environments (mkfs and mount options,
etc.).  Hooks to run code before and after tests in the standard test
framework made it easy to add lockstat and other instrumentation.

It worked well enough, though Autotest contained a number of test environment
assumptions that conflicted with what I wanted to do from time to time, and
required custom workarounds to its framework.  A number of new versions have
been released since then, and a quick look suggests that there have been some
substantial changes (contains some test scenarios for fio, ffsb, xfstests).
Using Autotest means working in Python, though, and some prefer a simpler
approach using shell scripts.

Autotest's server code can be used to control client code on test systems,
scheduling and operating tests, archiving results in data bases, and
postprocessing data.  That was more complexity than I wanted, so I simply
archived my results in my own filesystem directory structure.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html