Hello PG perf junkies, Sorry this may get a little long winded. Apologies if the formatting gets trashed. Also apologies if this double posts. (I originally set it yesterday with the wrong account and the message is stalled - so my bad there) if someone is a mod and it's still in the wait queue feel free to remove them. Short version: my zcav and dd tests look to get ->CPU bound<-. Yes CPU bound, with junk numbers. My numbers in ZCAV are flat like and SSD which is odd for 15K rpm disks. I am not sure what the point of moving further would be given these unexpected poor numbers. Well I knew 12 disks wasn't going to be something that impressed me, I am used to 24, but I was expecting about 40-50% better than what I am getting. Background: I have been setting up some new servers for PG and I am getting some odd numbers with zcav, I am hoping a second set of eyes here can point me in the right direction. (other tests like bonniee++ (1.03e) and dd also give me odd (flat and low) numbers) I will preface this with, yes I bought greg's book. Yes I read it, and it has helped me in the past, but seem to have hit an oddity. (hardware,os, and config stuff listed at the end) Long version: In the past when dealing with storage I typically see a large gain with moving from ext3 to XFS, provided I set readahead to 16384 on either filesystem. I also see typical down ward trends in the MB/s (expected) and upward trends in access times (expected) with either file system. These blades + storage-blades are giving me atypical results . I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing access time really increase. (something I have only seen before when I forget to have readahead set high enough) things are just flat at about 420MB/s in zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for ext3. FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96 sometimes, which isn't something I have had happen before even though greg does mention it. Also when running zcav I will see kswapdX (0 and 1 in my two socket case) start to eat significant cpu time (~40-50% each), with dd - kswapd and pdflush become very active as well. This only happens once free mem gets low. As well zcav or dd looks to get CPU bound at 100% while i/o wait stays almost at 0.0 most of the time. (iostat -x -d shows util % at 98% though). I see this with either XFS or ext3. Also when I cat /proc/zoneinfo it looks like I am getting heavy contention for a single page in DMA while the tests are running. (see end of email for zoneinfo) Bonnie is giving me 99% cpu usage reported. Watching it while running it bounces between 100 and 99. Kswap goes nuts here as well. I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher kernel to see some of the kswapd issues go away. (testing that hopefully later this week). Maybe that will take care of everything. I don't know yet. Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although others on the RHEL support site indicated it did fix kswap issues for them. Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC using ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of i/o wait as expected, and my zoneinfo for DMA doesn't sit at 1) Not going to focus too much on ext3 since I am pretty sure I should be able to get better numbers from XFS. With mkfs.xfs I have done some reading and it appears that it can't automatically read the stripsize (aka stripe size to anyone other than HP) or the number of disks. So I have been using the following: mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256 (256K is the default hp stripsize for raid1+0, I have 12 disks in raid 10 so I used sw=6, agcount of 256 because that is a (random) number I got from google that seemed in the ball park.) which gives me: meta-data=/dev/cciss/c0d0 isize=256 agcount=256, agsize=839936 blks = sectsz=512 attr=2 data = bsize=4096 blocks=215012774, imaxpct=25 = sunit=64 swidth=384 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 (if I don't specif the agcount or su,sw stuff I get meta-data=/dev/cciss/c0d0 isize=256 agcount=4, agsize=53753194 blks = sectsz=512 attr=2 data = bsize=4096 blocks=215012774, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0) ) So it seems like I should be giving it the extra parameters at mkfs.xfs time... could someone confirm ? In the past I have never specified the su or sw or ag groups I have taken the defaults. But since I am getting odd numbers here I started playing with em. Getting little or no change. for mounting: logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m (I know that noatime also means nodiratime according xfs.org, but in the past I seem to get better numbers when having both) I am using nobarrier because I have a battery backed raid cache and the FAQ @ XFS.org seems to indicate that is the right choice. FWIW, if I put sunit and swidth in the mount options it seems to change them lower (when viewed with xfs_info) so I haven't been putting it in the mount options. verify readahead: blockdev --getra /dev/cciss/c0d0 16384 If anyone wants the benchmark outputs I can send them, but basically zcav being FLAT for bother MB/s and access time tells me something is wrong. And it will take days for me to re run all the ones I have done. I didn't save much once I saw results that don't fit with what I thought I should get. I haven't done much with pgbench yet as I figure its pointless to move on while the raw I/O numbers look off to me. At that time I am going to make the call between wal on the OS raid 1 or going to 10 data disks and 2 os and 2 wal. I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the issue went away, it didn't. I have gone back to 2.6.18-238.5 and put in a new CCISS driver directly from HP, and the issue also does not go away. People at work are thinking it might kernel bug that we have somehow never notice before which is why we are going to look at RHEL 6.1. we tried a 5.3 kernel that someone on rh bugzilla said didn't have the issue but this blade had a fit with it - no network, lots of other stuff not working and then it kernel panic'd so we quickly gave up on that... We may try and shoehorn in the 6.1 kernel and a few dependencies as well. Moving to RHEL 6.1. will mean a long test period before it can go into prod and we want to get this new hardware in sooner than that can be done. (even with all it's problems its probably still faster than what it is replacing just from the 48GB of ram and 3 gen newer CPUS) Hardware and config stuff as it sits right now. Blade Hardware: ProLiant BL460c G7 (bios power flag set to high performance) 2 intel 5660 cpus. (HT left on) 48GB of ram (12x4GB @ 1333MHz) Smart Array P410i (Embedded) Points of interest from hpacucli - - Hardware Revision: Rev C - Firmware Version: 3.66 - Cache Board Present: True - Elevator Sort: Enabled - Cache Status: OK - Cache Backup Power Source: Capacitors - Battery/Capacitor Count: 1 - Battery/Capacitor Status: OK - Total Cache Size: 512 MB - Accelerator Ratio: 25% Read / 75% Write - Strip Size: 256 KB - 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3) - Array Accelerator: Enabled - Status: OK - drives firmware = HPD5 Blade Storage subsystem: HP SB2200 (12 disk 15K ) Points of interest from hpacucli Smart Array P410i in Slot 3 Controller Status: OK Hardware Revision: Rev C Firmware Version: 3.66 Elevator Sort: Enabled Wait for Cache Room: Disabled Cache Board Present: True Cache Status: OK Accelerator Ratio: 25% Read / 75% Write Drive Write Cache: Disabled Total Cache Size: 1024 MB No-Battery Write Cache: Disabled Cache Backup Power Source: Capacitors Battery/Capacitor Count: 1 Battery/Capacitor Status: OK SATA NCQ Supported: True Logical Drive: 1 Size: 820.2 GB Fault Tolerance: RAID 1+0 Heads: 255 Sectors Per Track: 32 Cylinders: 65535 Strip Size: 256 KB Status: OK Array Accelerator: Enabled Disk Name: /dev/cciss/c0d0 Mount Points: /raid 820.2 GB OS Status: LOCKED 12 drives in Raid 1+0, using XFS. OS: OS: RHEL 5.6 (2.6.18-238.9.1.el5) Database use: PG 9.0.2 for OLTP. CCISS info: filename: /lib/modules/2.6.18-238.9.1.el5/kernel/drivers/block/cciss.ko version: 3.6.22-RH1 description: Driver for HP Controller SA5xxx SA6xxx version 3.6.22-RH1 author: Hewlett-Packard Company XFS INFO: xfsdump-2.2.48-3.el5 xfsprogs-2.10.2-7.el5 XFS mkfs string: mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256 mkfs.xfs output: meta-data=/dev/cciss/c0d0 isize=256 agcount=256, agsize=839936 blks = sectsz=512 attr=2 data = bsize=4096 blocks=215012774, imaxpct=25 = sunit=64 swidth=384 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 head of ZONEINFO while zcav is running and kswap is going nuts: the min,low,high of 1 seems odd to me. On other systems these get above 1. Node 0, zone DMA pages free 2493 min 1 low 1 high 1 active 0 inactive 0 scanned 0 (a: 3 i: 3) spanned 4096 present 2393 nr_anon_pages 0 nr_mapped 1 nr_file_pages 0 nr_slab 0 nr_page_table_pages 0 nr_dirty 0 nr_writeback 0 nr_unstable 0 nr_bounce 0 numa_hit 0 numa_miss 0 numa_foreign 0 numa_interleave 0 numa_local 0 numa_other 0 protection: (0, 3822, 24211, 24211) pagesets all_unreclaimable: 1 prev_priority: 12 start_pfn: 0 numastat (probably worthless since I have been pounding on this box for a while before capturing it) node0 node1 numa_hit 3126413031 247696913 numa_miss 95489353 2781917287 numa_foreign 2781917287 95489353 interleave_hit 81178 97872 local_node 3126297257 247706110 other_node 95605127 2781908090 -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance