Re: Effects of setting linux block device readahead size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hmm, I would expect this tunable to potentially be rather file system dependent, and potentially raid controller dependant.  The test was using ext2, perhaps the others automatically prefetch or read ahead?   Does it vary by RAID controller?

Well I went and found out, using ext3 and xfs.  I have about 120+ data points but here are a few interesting ones before I compile the rest and answer a few other questions of my own.

1:  readahead does not affect "pure" random I/O -- there seems to be a heuristic trigger -- a single process or file probably has to request a sequence of linear I/O of some size to trigger it.  I set it to over 64MB of read-ahead and random iops remained the same to prove this.
2:  File system matters more than you would expect.  XFS sequential transfers when readahead was tuned had TWICE the sequential throughput of ext3, both for a single reader and 8 concurrent readers on 8 different files.
3:  The RAID controller and its configuration make a pretty significant difference as well.

Hardware:
12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3, xfs)
I also have some results with PERC raid 10 with 4x 15K SAS, not reporting in this message though


Testing process:
All tests begin with
#sync; echo 3 > /proc/sys/vm/drop_caches;
followed by
#blockdev --setra XXX /dev/sdb
Even though FIO claims that it issues reads that don't go to cache, the read-ahead DOES go to the file system cache, and so one must drop them to get consistent results unless you disable the read-ahead.  Even if you are reading more than 2x the physical RAM, that first half of the test is distorted.  By flushing the cache first my results became consistent within about +-2%.

Tests
-- fio, read 8 files concurrently, sequential read profile, one process per file:
[seq-read8]
rw=read
; this will be total of all individual files per process
size=8g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=8
; this is number of files total per process
nrfiles=1
runtime=1m

-- fio, read one large file sequentially with one process
[seq-read]
rw=read
; this will be total of all individual files per process
size=64g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=1
; this is number of files total per process
nrfiles=1
runtime=1m

-- 'dd' in a few ways:
Measure direct to partition / disk read rate at the start of the disk:
'dd if=/dev/sdb of=/dev/null ibs=24M obs=64K'
Measure direct to partition / disk read rate near the end of the disk:
'dd if=/dev/sdb1 of=/dev/null ibs=24M obs=64K skip=160K'
Measure direct read of the large file used by the FIO one sequential file test:
'dd if=/data/test/seq-read.1.0 of=/dev/null ibs=32K obs=32K'

the dd paramters for block sizes were chosen with much experimentation to get the best result.


Results:
I've got a lot of results, I'm only going to put a few of them here for now while I investigate a few other things (see the end of this message)
Preliminary summary:

PERC 6, ext3, full partition.
dd beginning of disk :  642MB/sec
dd end of disk: 432MB/sec
dd large file (readahead 49152): 312MB/sec
-- maximum expected sequential capabilities above?

fio: 8 concurrent readers and 1 concurrent reader results
readahead is in 512 byte blocks, sequential transfer rate in MiB/sec as reported by fio.

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  311  |  314
16384  |  312  |  312
12288  |  304  |  309
 8192  |  292  |
 4096  |  264  |
 2048  |  211  |
 1024  |  162  |  302
  512  |  108  |
  256  |  81  | 300
    8  |  38  |

Conclusion, on this array going up to 12288 (6MB) readahead makes a huge impact on concurrent sequential reads.  That is 1MB per raid slice (6, 12 disks raid 10).  Sequential read performance under concurrent.  It has almost no impact at all on one sequential read alone, the OS or the RAID controller are dealing with that case just fine.

But, how much of the above effect is ext3?  How much is it the RAID card?  At the top end, the sequential rate for both concurrent and single sequential access is in line with what dd can get going through ext3.  But it is not even close to what you can get going right to the device and bypassing the file system.

Lets try a different RAID card first.  The disks aren't exactly the same, and there is no guarantee that the file is positioned near the beginning or end, but I've got another 12 disk RAID 10, using a 3Ware 9650 card.

Results, as above -- don't conclude this card is faster, the files may have just been closer to the front of the partition.
dd, beginning of disk: 522MB/sec
dd, end of disk array: 412MB/sec
dd, file read via file system (readahead 49152): 391MB/sec

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  343  |  392
16384  |  349  |  379
12288  |  348  |  387
 8192  |  344  |
 6144  |      |  376
 4096  |  340  |
 2048  |  319  |
 1024  |  284  |  371
  512  |  239  |  376
  256  |  204  |  377
  128  |  169  |  386
    8  |  47  |  382

Conclusion, this RAID controller definitely behaves differently:  It is much less sensitive to the readahead.  Perhaps it has a larger stripe size?  Most likely, this one is set up with a 256K stripe, the other one I do not know, though the PERC 6 default is 64K which may be likely.
 

Ok, so the next question is how file systems play into this.
First, I ran a bunch of tests with xfs, and the results were rather odd.  That is when I realized that the platter speeds at the start and end of the arrays is significantly different, and xfs and ext3 will both make different decisions on where to put the files on an empty partition (xfs will spread them evenly, ext3 more close together but still somewhat random on the actual position).

so, i created a partition that was roughly 10% the size of the whole thing, at the beginning of the array.

Using the PERC 6 setup, this leads to:
dd, against partition: 660MB/sec max result, 450MB/sec min -- not a reliable test for some reason
dd, against file on the partition (ext3): 359MB/sec

ext3 (default settings):
readahead  |  8 conc read rate  |  1 conc read rate
49152  |  363  | 
12288  |  359  | 
  6144  |  319  | 
  1024  |  176  |
   256  |      |
 
Analysis:  I only have 8 concurrent read results here, as these are the most interesting based on the results from the whole disk tests above.  I also did not collect a lot of data points.
What is clear, is that the partition at the front does make a difference, compared to the whole partition results we have about 15% more throughput on the 8 concurrent read test, meaning that ext3 probably put the files in the whole disk case near the middle of the drive geometry.
The 8 concurrent read test has the same "break point" at about 6MB read ahead buffer, which is also consistent.

And now, for XFS, a full result set and VERY surprising results.  I dare say, the benchmarks that led me to do these tests are not complete without XFS tests:

xfs (default settings):
readahead  |  8 conc read rate  |  1 conc read rate
98304  |  651  |  640
65536  |  636  |  609
49152  |  621  |  595
32768  |  602  |  565
24576  |  595  |  548
16384  |  560  |  518
12288  |  505  |  480
 8192  |  437  |  394
 6144  |  412  |  415 *
 4096  |  357  |  281 *
 3072  |  329  |  338
 2048  |  259  |  383
 1536  |  230  |  445
 1280  |  207  |  542
 1024  |  182  |  605  *
  896  |  167  |  523
  768  |  148  |  456
  512  |  119  |  354
  256  |   88   |  303
   64  |   60   | 171
    8  |  36  |  55

* these local max and mins for the sequential transfer were tested several times to validate.  They may have something to do with me not tuning the inode layout for an array using the xfs stripe unit and stripe width parameters.

dd, on the file used in the single reader sequential read test:
660MB/sec.   One other result for the sequential transfer, using a gigantic 393216 (192MB) readahead:
672 MB/sec.

Analysis:
XFS gets significantly higher sequential (read) transfer rates than ext3.  It had higher write results but I've only done one of those.
Both ext3 and xfs can be tuned a bit more, mainly with noatime and some parameters so they know about the geometry of the raid array.


Other misc results:
 I used the deadline scheduler, it didn't impact the results here.
 I ran some tests to "feel out" the sequential transfer rate sensitivity to readahead for a 4x 15K RPM SAS raid setup -- it is less sensitive:
  ext3, 8 concurrent reads -- readahead = 256, 195MB/sec;  readahead = 3072, 200MB/sec; readahead = 32768, 210MB/sec; readahead =64, 120MB/sec
On the 3ware setup, with ext3, postgres was installed and a select count(1) from table reported between 300 and 320 MB/sec against tables larger than 5GB, and disk utilization was about 88%.  dd can get 390 with the settings used (readahead 12288).
Setting the readahead back to the default, postgres gets about 220MB/sec at 100% disk util on similar tables.  I will be testing out xfs on this same data eventually, and expect it to provide significant gains there.

Remaining questions:
Readahead does NOT activate for pure random requests, which is a good thing.   The question is, when does it activate?  I'll have to write some custom fio tests to find out.  I suspect that when the OS detects either:  X number of sequential requests on the same file (or from the same process), it activates.  OR after sequential acces of at least Y bytes.  I'll report results once I know, to construct some worst case scenarios of using a large readahead.  
I will also measure its affect when mixed random access and streaming reads occur.


On Wed, Sep 10, 2008 at 7:49 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
On Tue, 9 Sep 2008, Mark Wong wrote:

I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.

Ah ha, told you that was your missing tunable.  I'd really like to see the whole table of one disk numbers re-run when you get a chance.  The reversed ratio there on ext2 (59MB read/92MB write) was what tipped me off that something wasn't quite right initially, and until that's fixed it's hard to analyze the rest.

Based on your initial data, I'd say that the two useful read-ahead settings for this system are 1024KB (conservative but a big improvement) and 8192KB (point of diminishing returns).  The one-disk table you've got (labeled with what the default read-ahead is) and new tables at those two values would really flesh out what each disk is capable of.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux