benchmark woes and XFS options

"mark" <dvlhntr@xxxxxxxxx> · Mon, 8 Aug 2011 20:06:28 -0600

Hello PG perf junkies, 

Sorry this may get a little long winded. Apologies if the formatting gets
trashed. Also apologies if this double posts. (I originally set it yesterday
with the wrong account and the message is stalled - so my bad there) if
someone is a mod and it's still in the wait queue feel free to remove them. 

Short version: 
my zcav and dd tests look to get ->CPU bound<-. Yes CPU bound, with junk
numbers. My numbers in ZCAV are flat like and SSD which is odd for 15K rpm
disks. I am not sure what the point of moving further would be given these
unexpected poor numbers. Well I knew 12 disks wasn't going to be something
that impressed me, I am used to 24, but I was expecting about 40-50% better
than what I am getting.

Background:

I have been setting up some new servers for PG and I am getting some odd
numbers with zcav, I am hoping a second set of eyes here can point me in the
right direction. (other tests like bonniee++ (1.03e) and dd also give me odd
(flat and low) numbers)

I will preface this with, yes I bought greg's book. Yes I read it, and it
has helped me in the past, but seem to have hit an oddity. 

(hardware,os, and config stuff listed at the end)

Long version:

In the past when dealing with storage I typically see a large gain with
moving from ext3 to XFS, provided I set readahead to 16384 on either
filesystem.

I also see typical down ward trends in the MB/s (expected) and upward trends
in access times (expected) with either file system. 

These blades + storage-blades are giving me atypical results .

I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing access
time really increase. (something I have only seen before when I forget to
have readahead set high enough) things are just flat at about 420MB/s in
zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for ext3.

FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
sometimes, which isn't something I have had happen before even though greg
does mention it. 

Also when running zcav I will see kswapdX (0 and 1 in my two socket case)
start to eat significant cpu time (~40-50% each), with dd - kswapd and
pdflush become very active as well. This only happens once free mem gets
low. As well zcav or dd looks to get CPU bound at 100% while i/o wait stays
almost at 0.0 most of the time. (iostat -x -d shows util % at 98% though). I
see this with either XFS or ext3. Also when I cat /proc/zoneinfo it looks
like I am getting heavy contention for a single page in DMA while the tests
are running. (see end of email for zoneinfo)

Bonnie is giving me 99% cpu usage reported. Watching it while running it
bounces between 100 and 99. Kswap goes nuts here as well. 

I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher kernel to
see some of the kswapd issues go away. (testing that hopefully later this
week). Maybe that will take care of everything. I don't know yet. 

 Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although others
on the RHEL support site indicated it did fix kswap issues for them.  

Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC using
ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of i/o wait
as expected, and my zoneinfo for DMA doesn't sit at 1)

Not going to focus too much on ext3 since I am pretty sure I should be able
to get better numbers from XFS. 

With mkfs.xfs I have done some reading and it appears that it can't
automatically read the stripsize (aka stripe size to anyone other than HP)
or the number of disks. So I have been using the following:

mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256

(256K is the default hp stripsize for raid1+0, I have 12 disks in raid 10 so
I used sw=6, agcount of 256 because that is a (random) number I got from
google that seemed in the ball park.)

which gives me:
meta-data=/dev/cciss/c0d0        isize=256    agcount=256, agsize=839936
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=215012774, imaxpct=25
         =                       sunit=64     swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

(if I don't specif the agcount or su,sw stuff I get
meta-data=/dev/cciss/c0d0        isize=256    agcount=4, agsize=53753194
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=215012774, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0)

)

So it seems like I should be giving it the extra parameters at mkfs.xfs
time... could someone confirm ? In the past I have never specified the su or
sw or ag groups I have taken the defaults. But since I am getting odd
numbers here I started playing with em. Getting little or no change. 

for mounting:
logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m

(I know that noatime also means nodiratime according xfs.org, but in the
past I seem to get better numbers when having both)

I am using nobarrier because I have a battery backed raid cache and the FAQ
@ XFS.org seems to indicate that is the right choice. 

FWIW, if I put sunit and swidth in the mount options it seems to change them
lower (when viewed with xfs_info) so I haven't been putting it in the mount
options. 

verify readahead:
blockdev --getra /dev/cciss/c0d0
16384

If anyone wants the benchmark outputs I can send them, but basically zcav
being FLAT for bother MB/s and access time tells me something is wrong. And
it will take days for me to re run all the ones I have done. I didn't save
much once I saw results that don't fit with what I thought I should get.

I haven't done much with pgbench yet as I figure its pointless to move on
while the raw I/O numbers look off to me. At that time I am going to make
the call between wal on the OS raid 1 or going to 10 data disks and 2 os and
2 wal. 

I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the issue
went away, it didn't. I have gone back to 2.6.18-238.5 and put in a new
CCISS driver directly from HP, and the issue also does not go away. People
at work are thinking it might kernel bug that we have somehow never notice
before which is why we are going to look at RHEL 6.1.  we tried a 5.3 kernel
that someone on rh bugzilla said didn't have the issue but this blade had a
fit with it - no network, lots of other stuff not working and then it kernel
panic'd so we quickly gave up on that... 

We may try and shoehorn in the 6.1 kernel and a few dependencies as well.
Moving to RHEL 6.1. will mean a long test period before it can go into prod
and we want to get this new hardware in sooner than that can be done.  (even
with all it's problems its probably still faster than what it is replacing
just from the 48GB of ram and 3 gen newer CPUS)

Hardware and config stuff as it sits right now.

Blade Hardware:
ProLiant BL460c G7 (bios power flag set to high performance)
2 intel 5660 cpus. (HT left on)
48GB of ram (12x4GB @ 1333MHz)
Smart Array P410i (Embedded)
Points of interest from hpacucli -
	- Hardware Revision: Rev C
	- Firmware Version: 3.66
	- Cache Board Present: True
	- Elevator Sort: Enabled
	- Cache Status: OK
	- Cache Backup Power Source: Capacitors
   	- Battery/Capacitor Count: 1
   	- Battery/Capacitor Status: OK
	- Total Cache Size: 512 MB
   	- Accelerator Ratio: 25% Read / 75% Write
	- Strip Size: 256 KB
	- 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
	- Array Accelerator: Enabled
	- Status: OK
	- drives firmware = HPD5

Blade Storage subsystem:
HP SB2200 (12 disk 15K )

Points of interest from hpacucli 

Smart Array P410i in Slot 3
   Controller Status: OK
   Hardware Revision: Rev C
   Firmware Version: 3.66
   Elevator Sort: Enabled
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 25% Read / 75% Write
   Drive Write Cache: Disabled
   Total Cache Size: 1024 MB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Capacitors
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: True

      Logical Drive: 1
         Size: 820.2 GB
         Fault Tolerance: RAID 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Status: OK
         Array Accelerator: Enabled
         Disk Name: /dev/cciss/c0d0
         Mount Points: /raid 820.2 GB
         OS Status: LOCKED

12 drives in Raid 1+0, using XFS. 

OS: 
OS: RHEL 5.6 (2.6.18-238.9.1.el5)
Database use: PG 9.0.2 for OLTP. 

CCISS info:
filename:
/lib/modules/2.6.18-238.9.1.el5/kernel/drivers/block/cciss.ko
version:        3.6.22-RH1
description:    Driver for HP Controller SA5xxx SA6xxx version 3.6.22-RH1
author:         Hewlett-Packard Company

XFS INFO:
xfsdump-2.2.48-3.el5
xfsprogs-2.10.2-7.el5

XFS mkfs string:
mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256

mkfs.xfs output:
meta-data=/dev/cciss/c0d0        isize=256    agcount=256, agsize=839936
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=215012774, imaxpct=25
         =                       sunit=64     swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

head of ZONEINFO while zcav is running and kswap is going nuts:
the min,low,high of 1 seems odd to me. On other systems these get above 1. 

Node 0, zone      DMA
  pages free     2493
        min      1
        low      1
        high     1
        active   0
        inactive 0
        scanned  0 (a: 3 i: 3)
        spanned  4096
        present  2393
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab      0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    numa_hit     0
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   0
        protection: (0, 3822, 24211, 24211)
  pagesets
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         0

numastat (probably worthless since I have been pounding on this box for a
while before capturing it)

                           node0           node1
numa_hit              3126413031       247696913
numa_miss               95489353      2781917287
numa_foreign          2781917287        95489353
interleave_hit             81178           97872
local_node            3126297257       247706110
other_node              95605127      2781908090

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance