On Wed, Jun 2, 2010 at 7:30 PM, Craig James <craig_james@xxxxxxxxxxxxxx> wrote: > I'm testing/tuning a new midsize server and ran into an inexplicable > problem. With an RAID10 drive, when I move the WAL to a separate RAID1 > drive, TPS drops from over 1200 to less than 90! I've checked everything > and can't find a reason. > > Here are the details. > > 8 cores (2x4 Intel Nehalem 2 GHz) > 12 GB memory > 12 x 7200 SATA 500 GB disks > 3WARE 9650SE-12ML RAID controller with bbu > 2 disks: RAID1 500GB ext4 blocksize=4096 > 8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see > below) > 2 disks: hot swap > Ubuntu 10.04 LTS (Lucid) > > With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results > (this one is for xfs): > > Version 1.03e ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec > %CP > argon 24064M 70491 99 288158 25 129918 16 65296 97 428210 23 > 558.9 1 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 23283 81 +++++ +++ 13775 56 20143 74 +++++ +++ 15152 > 54 > argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\ > ,74,+++++,+++,15152,54 > > pgbench -i -s 100 -U test > pgbench -c 10 -t 10000 -U test > scaling factor: 100 > query mode: simple > number of clients: 10 > number of transactions per client: 10000 > number of transactions actually processed: 100000/100000 > tps = 1046.104635 (including connections establishing) > tps = 1046.337276 (excluding connections establishing) > > Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE > controller, two more SATA 7200 disks). Run the same tests and ... > > tps = 82.325446 (including connections establishing) > tps = 82.326874 (excluding connections establishing) > > I thought I'd made a mistake, like maybe I moved the whole database to the > RAID1 array, but I checked and double checked. I even watched the lights > blink - the WAL was definitely on the RAID1 and the rest of Postgres on the > RAID10. > > So I moved the WAL back to the RAID10 array, and performance jumped right > back up to the >1200 TPS range. > > Next I check the RAID1 itself: > > dd if=/dev/zero of=./bigfile bs=8192 count=2000000 > > which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good > performance too: > > Version 1.03e ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec > %CP > argon 24064M 68601 99 110057 18 46534 6 59883 90 123053 7 > 471.3 1 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ > +++ > argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\ > +++,+++++,+++,+++++,+++ > > So ... anyone have any idea at all how TPS drops to below 90 when I move the > WAL to a separate RAID1 disk? Does this make any sense at all? It's > repeatable. It happens for both ext4 and xfs. It's weird. > > You can even watch the disk lights and see it: the RAID10 disks are on > almost constantly when the WAL is on the RAID10, but when you move the WAL > over to the RAID1, its lights are dim and flicker a lot, like it's barely > getting any data, and the RAID10 disk's lights barely go on at all. *) Is your raid 1 configured writeback cache on the controller? *) have you tried changing wal_sync_method to fdatasync? merlin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance