Re: Effects of setting linux block device readahead size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 10 Sep 2008, Scott Carey wrote:

Ok, so this is a drive level parameter that affects the data going into the
disk cache?  Or does it also get pulled over the SATA/SAS link into the OS
page cache?

It's at the disk block driver level in Linux, so I believe that's all going into the OS page cache. They've been rewriting that section a bit and I haven't checked it since that change (see below).

Additionally, I would like to know how this works with hardware RAID -- Does
it set this value per disk?

Hardware RAID controllers usually have their own read-ahead policies that may or may not impact whether the OS-level read-ahead is helpful. Since Mark's tests are going straight into the RAID controller, that's why it's helpful here, and why many people don't ever have to adjust this parameter. For example, it doesn't give a dramatic gain on my Areca card even in JBOD mode, because that thing has its own cache to manage with its own agenda.

Once you start fiddling with RAID stripe sizes as well the complexity explodes, and next thing you know you're busy moving the partition table around to make the logical sectors line up with the stripes better and similar exciting work.

Additionally, the O/S should have a good heuristic based read-ahead process
that should make the drive/device level read-ahead much less important.  I
don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642

That was committed in 2.6.23:

http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

but clearly some larger minimum hints still helps, as the system we've been staring at benchmarks has that feature.

Some chunk of data in that seek is free, afterwords it is surely not...

You can do a basic model of the drive to get a ballpark estimate on these things like I threw out, but trying to break down every little bit gets hairy. In most estimation cases you see, where 128kB is the amount being read, the actual read time is so small compared to the rest of the numbers that it just gets ignored.

I was actually being optimistic about how much cache can get filled by seeks. If the disk is spinning at 15000RPM, that's 4ms to do a full rotation. That means that on average you'll also wait 2ms just to get the heads lined up to read that one sector on top of the 4ms seek to get in the area; now we're at 6ms before you've read anything, topping seeks out at under 167/second. That number--average seek time plus half a rotation--is what a lot of people call the IOPS for the drive. There, typically the time spent actually reading data once you've gone through all that doesn't factor in. IOPS is not very well defined, some people *do* include the reading time once you're there; one reason I don't like to use it. There's a nice chart showing some typical computations here at http://www.dbasupport.com/oracle/ora10g/disk_IO_02.shtml if anybody wants to see how this works for other classes of disk. The other reason I don't like focusing too much on IOPS (some people act like it's the only measurement that matters) is that it tells you nothing about the sequential read rate, and you have to consider both at once to get a clear picture--particularly when there are adjustments that impact those two oppositely, like read-ahead.

As far as the internal transfer speed of the heads to the drive's cache once it's lined up, those are creeping up toward the 200MB/s range for the kind of faster drives the rest of these stats come from. So the default of 128kB is going to take 0.6ms, while a full 1MB might take 5ms. You're absolutely right to question how hard that will degrade seek performance; these slightly more accurate numbers suggest that might be as bad as going from 6.6ms to 11ms per seek, or from 150 IOPS to 91 IOPS. It also points out how outrageously large the really big read-ahead numbers are once you're seeking instead of sequentially reading.

One thing it's hard to know is how much read-ahead the drive was going to do on its own, no matter what you told it, anyway as part of its caching algorithm.

I suppose I should learn more about pgbench.

Most people use it as just a simple benchmark that includes a mixed read/update/insert workload. But that's internally done using a little command substition "language" that let's you easily write things like "generate a random number between 1 and 1M, read the record from this table, and then update this associated record" that scale based on how big the data set you've given it is. You an write your own scripts in that form too. And if you specify several scripts like that at a time, it will switch between them at random, and you can analyze the average execution time broken down per type if you save the latency logs. Makes it real easy to adjust the number of clients and the mix of things you have them do.

The main problem: it doesn't scale to large numbers of clients very well. But it can easily simulate 50-100 banging away at a time which is usually enough to rank filesystem concurrency capabilities, for example. It's certainly way easier to throw together a benchmark using it that is similar to an abstract application than it is to try and model multi-user database I/O using fio.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux