On Wed, 10 Sep 2008, Scott Carey wrote:
Ok, so this is a drive level parameter that affects the data going into the
disk cache? Or does it also get pulled over the SATA/SAS link into the OS
page cache?
It's at the disk block driver level in Linux, so I believe that's all
going into the OS page cache. They've been rewriting that section a bit
and I haven't checked it since that change (see below).
Additionally, I would like to know how this works with hardware RAID -- Does
it set this value per disk?
Hardware RAID controllers usually have their own read-ahead policies that
may or may not impact whether the OS-level read-ahead is helpful. Since
Mark's tests are going straight into the RAID controller, that's why it's
helpful here, and why many people don't ever have to adjust this
parameter. For example, it doesn't give a dramatic gain on my Areca card
even in JBOD mode, because that thing has its own cache to manage with its
own agenda.
Once you start fiddling with RAID stripe sizes as well the complexity
explodes, and next thing you know you're busy moving the partition table
around to make the logical sectors line up with the stripes better and
similar exciting work.
Additionally, the O/S should have a good heuristic based read-ahead process
that should make the drive/device level read-ahead much less important. I
don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642
That was committed in 2.6.23:
http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7
but clearly some larger minimum hints still helps, as the system we've
been staring at benchmarks has that feature.
Some chunk of data in that seek is free, afterwords it is surely not...
You can do a basic model of the drive to get a ballpark estimate on these
things like I threw out, but trying to break down every little bit gets
hairy. In most estimation cases you see, where 128kB is the amount being
read, the actual read time is so small compared to the rest of the numbers
that it just gets ignored.
I was actually being optimistic about how much cache can get filled by
seeks. If the disk is spinning at 15000RPM, that's 4ms to do a full
rotation. That means that on average you'll also wait 2ms just to get the
heads lined up to read that one sector on top of the 4ms seek to get in
the area; now we're at 6ms before you've read anything, topping seeks out
at under 167/second. That number--average seek time plus half a
rotation--is what a lot of people call the IOPS for the drive. There,
typically the time spent actually reading data once you've gone through
all that doesn't factor in. IOPS is not very well defined, some people
*do* include the reading time once you're there; one reason I don't like
to use it. There's a nice chart showing some typical computations here at
http://www.dbasupport.com/oracle/ora10g/disk_IO_02.shtml if anybody wants
to see how this works for other classes of disk. The other reason I don't
like focusing too much on IOPS (some people act like it's the only
measurement that matters) is that it tells you nothing about the
sequential read rate, and you have to consider both at once to get a clear
picture--particularly when there are adjustments that impact those two
oppositely, like read-ahead.
As far as the internal transfer speed of the heads to the drive's cache
once it's lined up, those are creeping up toward the 200MB/s range for the
kind of faster drives the rest of these stats come from. So the default
of 128kB is going to take 0.6ms, while a full 1MB might take 5ms. You're
absolutely right to question how hard that will degrade seek performance;
these slightly more accurate numbers suggest that might be as bad as going
from 6.6ms to 11ms per seek, or from 150 IOPS to 91 IOPS. It also points
out how outrageously large the really big read-ahead numbers are once
you're seeking instead of sequentially reading.
One thing it's hard to know is how much read-ahead the drive was going to
do on its own, no matter what you told it, anyway as part of its caching
algorithm.
I suppose I should learn more about pgbench.
Most people use it as just a simple benchmark that includes a mixed
read/update/insert workload. But that's internally done using a little
command substition "language" that let's you easily write things like
"generate a random number between 1 and 1M, read the record from this
table, and then update this associated record" that scale based on how big
the data set you've given it is. You an write your own scripts in that
form too. And if you specify several scripts like that at a time, it will
switch between them at random, and you can analyze the average execution
time broken down per type if you save the latency logs. Makes it real easy
to adjust the number of clients and the mix of things you have them do.
The main problem: it doesn't scale to large numbers of clients very well.
But it can easily simulate 50-100 banging away at a time which is usually
enough to rank filesystem concurrency capabilities, for example. It's
certainly way easier to throw together a benchmark using it that is
similar to an abstract application than it is to try and model multi-user
database I/O using fio.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD