Re: analyzing, visualizing, understanding and rating fio data

Jens Axboe <axboe@xxxxxxxxx> · Tue, 31 Jul 2012 20:55:24 +0200

On 2012-07-28 01:58, Kyle Hailey wrote:
> I've been testing out fio a bit and found it more flexible than the
> other popular I/O benchmark tools such as Iozone and Bonnie++ and fio
> has a more active user community.
> 
> In order to easily run fio tests, I've written a wrapper script to go
> through a series of tests.
> In order to understand the output, I've written a wrapper script to
> extract and format the results of multiple tests.
> In order to try and understand the data I've written some graph routines in R.
> 
> The output of the graph routines is visible here:
> 
>      sites.google.com/site/oraclemonitor/i-o-graphics#TOC-Percentile-Latency
> 
> The scripts to run the tests, extract the data and graph the data in R
> are available here:
> 
>       github.com/khailey/fio_scripts/blob/master/README.md

Neat stuff!! I'd encourage you to send some of that stuff in so that it
could be included with fio. The graphic scripts that fio ships with are
some that I did fairly quickly, and they aren't super good.

> My main question is how does one extract key metrics from fio  runs
> and what steps does one take to understand and or rate the I/O
> subsystems based on the data?

I'm assuming you are using the terse/minimal CSV output format, and
extracing values from that?

> My area of interest is database I/O performance.  Databases have
> certain typical I/O access profiles.
> Most notably databases primarily do random I/O of a set size,
> typically 8K (though this can vary from 2K to 32K).
> 
> Looking at 1000s of database reports I typically see random I/O around
> 6ms-8ms on solid
> gear occasionally faster if some has some serious caching on the SAN
> and occasionally
> slower when the I/O subsystem is overtaxed, which fits into some
> numbers I just grab from a
> Google search:
> 
> speed  rot_lat  seek     total
> 10K    3ms      4.3ms    =  7.3
> 15K    2ms      3.8ms    =  5.8
> 
> 
> For rating I/O it seems easy to say something,  for random I/O, like
> 
> < 5ms awesome
> < 7ms good
> < 9ms pretty good
>> 9ms starting to have contention or slower gear
> 
> 
> First I'm sure these numbers are debatable, but more importantly they
> don't take into account throughput.
> The latency of a single users should be the base latency and then
> there should be a second value which the throughput that the I/O
> subsystem can sustain with some close factor of that base latency.
> 
> The above also doesn't take into account  wide distributions of
> latency and outliers. For outliers, how important is it that the
> 99.99% is far from average?  How concerning is it that the max is
> multi-second when the average is good?

It all depends on what you are running. For some workloads, it could be
a huge problem, for others not so much. 99.99% is also extreme. At least
for customers or use cases that I hear about, they are typically looking
at some X latency value at, say, the 99% percentile and some absolute
maximum that they can allow.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html