Great info Greg,
Some follow-up questions and information in-line:
Looking forward to it!
Some follow-up questions and information in-line:
On Wed, Sep 10, 2008 at 12:44 PM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
Ok, so this is a drive level parameter that affects the data going into the disk cache? Or does it also get pulled over the SATA/SAS link into the OS page cache? I've been searching around with google for the answer and can't seem to find it.
Additionally, I would like to know how this works with hardware RAID -- Does it set this value per disk? Does it set it at the array level (so that 1MB with an 8 disk stripe is actually 128K per disk)? Is it RAID driver dependant? If it is purely the OS, then it is above raid level and affects the whole array -- and is hence almost useless. If it is for the whole array, it would have horrendous negative impact on random I/O per second if the total readahead became longer than a stripe width -- if it is a full stripe then each I/O, even those less than the size of a stripe, would cause an I/O on every drive, dropping the I/O per second to that of a single drive.
If it is a drive level setting, then it won't affect i/o per sec by making i/o's span multiple drives in a RAID, which is good.
Additionally, the O/S should have a good heuristic based read-ahead process that should make the drive/device level read-ahead much less important. I don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642
Lets expand a bit on your model above for a single disk:
A single disk, with 4ms seeks, and max disk throughput of 125MB/sec. The interface can transfer 300MB/sec.
250 seeks/sec. Some chunk of data in that seek is free, afterwords it is surely not.
512KB can be read in 4ms then. A 1MB read-ahead would result in:
4ms seek, 8ms read. 1MB seeks/sec ~=83 seeks/sec.
However, some chunk of that 1MB is "free" with the seek. I'm not sure how much per drive, but it is likely on the order of 8K - 64K.
I suppose I'll have to experiment in order to find out. But I can't see how a 1MB read-ahead, which should take 2x as long as seek time to read off the platters, could not have significant impact on random I/O per second on single drives. For SATA drives the transfer rate to seek time ratio is smaller, and their caches are bigger, so a larger read-ahead will impact things less.
I suppose I should learn more about pgbench. Most of this depends on how much time it takes to do one versus the other. In my case, setting up the DB will take significantly longer than writing 1 or 2 more fio profiles. I categorize mixed load tests as basic test -- you don't want to uncover configuration issues after the application test that running a mix of read/write and sequential/random could have uncovered with a simple test. This is similar to increasing the concurrency. Some file systems deal with concurrency much better than others.
Absolutely -- its critical to run the synthetic tests, and the random read/write and sequential read/write are critical. These should be tuned and understood before going on to more complicated things.
However, once you actually go and set up a database test, there are tons of questions -- what type of database? what type of query load? what type of mix? how big? In my case, the answer is, our database, our queries, and big. That takes a lot of setup effort, and redoing it for each new file system will take a long time in my case -- pg_restore takes a day+. Therefore, I'd like to know ahead of time what file system + configuration combinations are a waste of time because they don't perform under concurrency with mixed workload. Thats my admiteddly greedy need for the extra test results.
On Wed, 10 Sep 2008, Scott Carey wrote:It still helps as long as you don't make the parameter giant. The read cache in a typical hard drive noawadays is 8-32MB. If you're seeking a lot, you still might as well read the next 1MB or so after the block requested once you've gone to the trouble of moving the disk somewhere. Seek-bound workloads will only waste a relatively small amount of the disk's read cache that way--the slow seek rate itself keeps that from polluting the buffer cache too fast with those reads--while sequential ones benefit enormously.
How does that readahead tunable affect random reads or mixed random /
sequential situations?
If you look at Mark's tests, you can see approximately where the readahead is filling the disk's internal buffers, because what happens then is the sequential read performance improvement levels off. That looks near 8MB for the array he's tested, but I'd like to see a single disk to better feel that out. Basically, once you know that, you back off from there as much as you can without killing sequential performance completely and that point should still support a mixed workload.
Disks are fairly well understood physical components, and if you think in those terms you can build a gross model easily enough:
Average seek time: 4ms
Seeks/second: 250
Data read/seek: 1MB (read-ahead number goes here)
Total read bandwidth: 250MB/s
Since that's around what a typical interface can support, that's why I suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's pretty close to optimal for sequential as well here (big improvement from the default Linux RA of 256 blocks=128K). If you know your work is biased heavily toward sequential scans, you might pick the 8MB read-ahead instead. That value (--setra=16384 -> 8MB) has actually been the standard "start here" setting 3ware suggests on Linux for a while now: http://www.3ware.com/kb/Article.aspx?id=11050
Ok, so this is a drive level parameter that affects the data going into the disk cache? Or does it also get pulled over the SATA/SAS link into the OS page cache? I've been searching around with google for the answer and can't seem to find it.
Additionally, I would like to know how this works with hardware RAID -- Does it set this value per disk? Does it set it at the array level (so that 1MB with an 8 disk stripe is actually 128K per disk)? Is it RAID driver dependant? If it is purely the OS, then it is above raid level and affects the whole array -- and is hence almost useless. If it is for the whole array, it would have horrendous negative impact on random I/O per second if the total readahead became longer than a stripe width -- if it is a full stripe then each I/O, even those less than the size of a stripe, would cause an I/O on every drive, dropping the I/O per second to that of a single drive.
If it is a drive level setting, then it won't affect i/o per sec by making i/o's span multiple drives in a RAID, which is good.
Additionally, the O/S should have a good heuristic based read-ahead process that should make the drive/device level read-ahead much less important. I don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642
Lets expand a bit on your model above for a single disk:
A single disk, with 4ms seeks, and max disk throughput of 125MB/sec. The interface can transfer 300MB/sec.
250 seeks/sec. Some chunk of data in that seek is free, afterwords it is surely not.
512KB can be read in 4ms then. A 1MB read-ahead would result in:
4ms seek, 8ms read. 1MB seeks/sec ~=83 seeks/sec.
However, some chunk of that 1MB is "free" with the seek. I'm not sure how much per drive, but it is likely on the order of 8K - 64K.
I suppose I'll have to experiment in order to find out. But I can't see how a 1MB read-ahead, which should take 2x as long as seek time to read off the platters, could not have significant impact on random I/O per second on single drives. For SATA drives the transfer rate to seek time ratio is smaller, and their caches are bigger, so a larger read-ahead will impact things less.
Trying to make disk benchmarks really complicated is a path that leads to a lot of wasted time. I one made this gigantic design plan for something that worked like the PostgreSQL buffer management system to work as a disk benchmarking tool. I threw it away after confirming I could do better with carefully scripted pgbench tests.
I would be very interested in a mixed fio profile with a "background writer"
doing moderate, paced random and sequential writes combined with concurrent
sequential reads and random reads.
If you want to benchmark something that looks like a database workload, benchmark a database workload. That will always be better than guessing what such a workload acts like in a synthetic fashion. The "seeks/second" number bonnie++ spits out is good enough for most purposes at figuring out if you've detuned seeks badly.
"pgbench -S" run against a giant database gives results that look a lot like seeks/second, and if you mix multiple custom -f tests together it will round-robin between them at random...
I suppose I should learn more about pgbench. Most of this depends on how much time it takes to do one versus the other. In my case, setting up the DB will take significantly longer than writing 1 or 2 more fio profiles. I categorize mixed load tests as basic test -- you don't want to uncover configuration issues after the application test that running a mix of read/write and sequential/random could have uncovered with a simple test. This is similar to increasing the concurrency. Some file systems deal with concurrency much better than others.
It's really helpful to measure these various disk subsystem parameters individually. Knowing the sequential read/write, seeks/second, and commit rate for a disk setup is mainly valuable at making sure you're getting the full performance expected from what you've got. Like in this example, where something was obviously off on the single disk results because reads were significantly slower than writes. That's not supposed to happen, so you know something basic is wrong before you even get into RAID and such. Beyond confirming whether or not you're getting approximately what you should be out of the basic hardware, disk benchmarks are much less useful than application ones.
Absolutely -- its critical to run the synthetic tests, and the random read/write and sequential read/write are critical. These should be tuned and understood before going on to more complicated things.
However, once you actually go and set up a database test, there are tons of questions -- what type of database? what type of query load? what type of mix? how big? In my case, the answer is, our database, our queries, and big. That takes a lot of setup effort, and redoing it for each new file system will take a long time in my case -- pg_restore takes a day+. Therefore, I'd like to know ahead of time what file system + configuration combinations are a waste of time because they don't perform under concurrency with mixed workload. Thats my admiteddly greedy need for the extra test results.
With all that, I think I just gave away what the next conference paper I've been working on is about.
Looking forward to it!