This question was posted originally on http://dba.stackexchange.com/questions/96444/cant-get-dell-pe-t420-perc-h710-perform-better-than-a-macmini-with-postgresql and they suggested to post it on this mailing list. It's months that I'm trying to solve a performance issue with PostgreSQL. I’m able to give you all the technical details needed.
SYSTEM CONFIGURATIONOur deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:
- VD0: two 15k SAS disks (ext4, OS partition, WAL partition, RAID1)
- VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)
This system has the following configuration:
- Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
- 128GB RAM (DDR3, 8x16GB @1600Mhz)
- two Intel Xeon E5-2640 v2 @2Ghz
- Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "disabled"):
- VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
- VD1 (Postgres data partition): ten 10k SAS disks (XFS, RAID5)
- PostgreSQL 9.4 (updated to the latest available version)
- moved pg_stat_tmp to RAM disk
My personal low cost and low profile development machine is a MacMini configured in this way:
- OS X Server 10.7.5
- 8GB RAM (DDR3, 2x4GB @1333Mhz)
- one Intel i7 @2.2Ghz
- two Internal 500GB 7.2k SAS HDD (non RAID) for OS partition
- external Promise Pegasus R1 connected with Thunderbolt v1 (512MB
RAM, four 1TB 7.2k SAS HDD 32MB cache, RAID5, Write cache: "WriteBack",
Read cache: "ReadAhead", Disk cache: "enabled", NCQ: "enabled")
- PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)
- moved pg_stat_tmp to RAM disk
So far I've made a lot of tuning adjustments to both machines,
including kernel reccomended ones on the official Postgres doc site.
APPLICATIONThe deployment machine runs a web platform which instructs Postgres
to make big transactions over billion of records. It's a platform
designed for one user because system resources have to be dedicated as
much as possible to one single job due to data size (I don't like to
call it big data because big data are in the order ob ten of billion).
ISSUEsI've found the deployment machine to be a lot slower than the
development machine. This is paradoxal because the two machine really
differs in many aspects. I've run many queries to investigate this
strange behaviour and have done a lot of tuning adjustments. During the last two months I've prepared and executed two type of query sets:
- A: these sets make use of
SELECT ... INTO , CREATE INDEX , CLUSTER and VACUUM ANALYZE .
- B: these sets are from our application generated transactions and make use of
SELECT over the tables created with set A.
A and B were always slower on T420. The only type of operation that was faster is the VACUUM ANALYZE .
RESULTSA type set:
- T420: went from 311seconds (default
postgresql.conf ) to 195seconds doing tuning adjustments over RAID, kernel and postgresql.conf ;
- MacMini: 40seconds.
B type set:
- T420: 141seconds;
- MacMini: 101seconds.
I've to mention that we have also adjusted the BIOS on T420 setting
all possible parameters to "performance" and disabling low energy
profiles. This lowered time execution over a type A set from 240seconds
to 211seconds. We have also upgrade all firmware and BIOS to the latest available versions.
Here are two benchmarks generated using pg_test_fsync : T420 pg_test_fsync
60 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 23358.758 ops/sec 43 usecs/op
fdatasync 21417.018 ops/sec 47 usecs/op
fsync 21112.662 ops/sec 47 usecs/op
fsync_writethrough n/a
open_sync 23082.764 ops/sec 43 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 11737.746 ops/sec 85 usecs/op
fdatasync 19222.074 ops/sec 52 usecs/op
fsync 18608.405 ops/sec 54 usecs/op
fsync_writethrough n/a
open_sync 11510.074 ops/sec 87 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 21484.546 ops/sec 47 usecs/op
2 * 8kB open_sync writes 11478.119 ops/sec 87 usecs/op
4 * 4kB open_sync writes 5885.149 ops/sec 170 usecs/op
8 * 2kB open_sync writes 3027.676 ops/sec 330 usecs/op
16 * 1kB open_sync writes 1512.922 ops/sec 661 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 17946.690 ops/sec 56 usecs/op
write, close, fsync 17976.202 ops/sec 56 usecs/op
Non-Sync'ed 8kB writes:
write 343202.937 ops/sec 3 usecs/op
MacMini pg_test_fsync
60 seconds per test
Direct I/O is not supported on this platform.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 3780.341 ops/sec 265 usecs/op
fdatasync 3117.094 ops/sec 321 usecs/op
fsync 3156.298 ops/sec 317 usecs/op
fsync_writethrough 110.300 ops/sec 9066 usecs/op
open_sync 3077.932 ops/sec 325 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 1522.400 ops/sec 657 usecs/op
fdatasync 2700.055 ops/sec 370 usecs/op
fsync 2670.652 ops/sec 374 usecs/op
fsync_writethrough 98.462 ops/sec 10156 usecs/op
open_sync 1532.235 ops/sec 653 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 2634.754 ops/sec 380 usecs/op
2 * 8kB open_sync writes 1547.801 ops/sec 646 usecs/op
4 * 4kB open_sync writes 801.542 ops/sec 1248 usecs/op
8 * 2kB open_sync writes 405.515 ops/sec 2466 usecs/op
16 * 1kB open_sync writes 204.095 ops/sec 4900 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 2747.345 ops/sec 364 usecs/op
write, close, fsync 3070.877 ops/sec 326 usecs/op
Non-Sync'ed 8kB writes:
write 3275.716 ops/sec 305 usecs/op
This confirms the hardware IO capabilities of T420 but doesn't explain why MacMini is MUCH MORE FAST.
Now let’s propose some query profiling times. B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results: T420 Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06 MacMini Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk
COMPILING PGSQLI compiled and tested all the latest pgsql versions (9.0.19, 9.1.15, 9.2.10, 9.3.6 and 9.4.1) using different combinations of parameters for gcc-4.9.1 (gcc 4.7 for pgsql 9.0.19) and Postgres (I’ve tried also clang compiler with different optimization flags with no benefits). I followed this article but I was unable to test the -flto option due to several errors returned by make. After two days of testing I went down from 195 to 189 seconds on T420
where MacMini still is 40 seconds (A set); and from 141 to 129 seconds where
MacMini is 101 seconds (B set). On MacMini I’ve used the built-in pgsql 9.0.13 version while on T420 I've used the following optimal compiling
options: ./configure CFLAGS="-O3 -fno-inline-functions -march=native"
--with-openssl --with-libxml --with-libxslt --with-wal-blocksize=64
--with-blocksize=32 --with-wal-segsize=64 --with-segsize=1
I've also tried to disable Hyper-Threading with echo 0 > /sys/devices/system/cpu/cpuN/online where cpuN
is the N-th logical CPU but nothing changed over B set queries. We have
2 CPU with 8 cores for a total of 16 physical cores and 16 logical
cores.
It seems like T420 doesn’t push on single transaction while is probably able to manage multiple connections much better than MacMini. I can’t figure out why it’s much much much slower than MacMini on any kind of query (from data loading to da selection). |