Re: Hardware/OS recommendations for large databases (

"Luke Lonergan" <llonergan@xxxxxxxxxxxxx> · Mon, 21 Nov 2005 20:35:26 -0800

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke

================= Results ===================

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>".  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[root@modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real    0m33.057s
user    0m0.116s
sys     0m13.577s

[root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real    0m33.032s
user    0m0.087s
sys     0m13.129s
============================================================================

So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus xfs:

============================================================================
16GB write, then read
============================================================================
-----------------------
ext2:
-----------------------
[root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 144.2670 secs, 113.5672 MB/sec

[root@modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 49.3766 secs, 331.8170 MB/sec

-----------------------
ext3, data=ordered:
-----------------------
[root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 137.1607 secs, 119.4511 MB/sec

[root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 48.7398 secs, 336.1527 MB/sec

-----------------------
xfs:
-----------------------
[root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 52.6141 secs, 311.3994 MB/sec

[root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
============================================================================

I'm liking xfs!  Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves.  There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
we aren't reading from the read cache.  I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.

============================================================================
26GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
----------
  3159138
(1 row)

3159138*8192/1000000
25879 Million Bytes, or 25.9GB

-----------------------
xfs:
-----------------------
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 99187.205 ms

-----------------------
ext2:
-----------------------
llonergan=# select relpages from pg_class where relname='lineitem';
 relpages 
----------
  3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;
   count   
-----------
 119994608
(1 row)

Time: 122822.090 ms
============================================================================
Analysis of Postgres 8.0.3 results
============================================================================
                              ext2        xfs
Write Speed                   114         311
Read Speed                    332         407
Postgres Seq Scan Speed       212         263
Scan % of lmdd Read Speed     63.9%       64.6%

Well - looks like we get linear scaling with disk/file subsystem speedup.

- Luke