Re: XFS options and benchmark woes

"mark" <mark@xxxxxxxx> · Mon, 8 Aug 2011 00:18:24 -0600

> -----Original Message-----
> From: mark [mailto:mark@xxxxxxxx]
> Sent: Monday, August 08, 2011 12:15 AM
> To: 'pgsql-performance@xxxxxxxxxxxxxx'
> Subject: XFS options and benchmark woes
> 
> Hello PG perf junkies,
> 
> 
> Sorry this may get a little long winded. Apologies if the formatting
> gets trashed.
> 
> 
> 
> Background:
> 
> I have been setting up some new servers for PG and I am getting some
> odd numbers with zcav, I am hoping a second set of eyes here can point
> me in the right direction. (other tests like bonniee++ (1.03e) and dd
> also give me odd (flat and low) numbers)
> 
> I will preface this with, yes I bought greg's book. Yes I read it, and
> it has helped me in the past, but seem to have hit an oddity.
> 
> (hardware,os, and config stuff listed at the end)
> 
> 
> 
> 
> 
> Short version: my zcav and dd tests look to get I/O bound. My numbers
> in ZCAV are flat like and SSD which is odd for 15K rpm disks.

uggg, ZCAV numbers appear to be CPU bound. Not i/o .

> 
> 
> 
> 
> Long version:
> 
> 
> In the past when dealing with storage I typically see a large gain with
> moving from ext3 to XFS, provided I set readahead to 16384 on either
> filesystem.
> 
> I also see typical down ward trends in the MB/s (expected) and upward
> trends in access times (expected) with either file system.
> 
> 
> These blades + storage-blades are giving me atypical results .
> 
> 
> I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing
> access time really increase. (something I have only seen before when I
> forget to have readahead set high enough) things are just flat at about
> 420MB/s in zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for
> ext3.
> 
> FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
> sometimes, which isn't something I have had happen before even though
> greg does mention it.
> 
> 
> Also when running zcav I will see kswapdX (0 and 1 in my two socket
> case) start to eat significant cpu time (~40-50% each), with dd -
> kswapd and pdflush become very active as well. This only happens once
> free mem gets low. As well zcav or dd looks to get CPU bound at 100%
> while i/o wait stays almost at 0.0 most of the time. (iostat -x -d
> shows util % at 98% though). I see this with either XFS or ext3. Also
> when I cat /proc/zoneinfo it looks like I am getting heavy contention
> for a single page in DMA while the tests are running. (see end of email
> for zoneinfo)
> 
> Bonnie is giving me 99% cpu usage reported. Watching it while running
> it bounces between 100 and 99. Kswap goes nuts here as well.
> 
> 
> I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher
> kernel to see some of the kswapd issues go away. (testing that
> hopefully later this week). Maybe that will take care of everything. I
> don't know yet.
> 
>  Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although
> others on the RHEL support site indicated it did fix kswap issues for
> them.
> 
> 
> 
> Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC
> using ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of
> i/o wait as expected, and my zoneinfo for DMA doesn't sit at 1)
> 
> Not going to focus too much on ext3 since I am pretty sure I should be
> able to get better numbers from XFS.
> 
> 
> 
> With mkfs.xfs I have done some reading and it appears that it can't
> automatically read the stripsize (aka stripe size to anyone other than
> HP) or the number of disks. So I have been using the following:
> 
> mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256
> 
> (256K is the default hp stripsize for raid1+0, I have 12 disks in raid
> 10 so I used sw=6, agcount of 256 because that is a (random) number I
> got from google that seemed in the ball park.)
> 
> 
> 
> 
> 
> 
> which gives me:
> meta-data=/dev/cciss/c0d0        isize=256    agcount=256,
> agsize=839936 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=215012774,
> imaxpct=25
>          =                       sunit=64     swidth=384 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=32768, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-
> count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> (if I don't specif the agcount or su,sw stuff I get
> meta-data=/dev/cciss/c0d0        isize=256    agcount=4,
> agsize=53753194 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=215012774,
> imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=32768, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-
> count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0)
> 
> )
> 
> 
> 
> 
> 
> So it seems like I should be giving it the extra parameters at mkfs.xfs
> time... could someone confirm ? In the past I have never specified the
> su or sw or ag groups I have taken the defaults. But since I am getting
> odd numbers here I started playing with em. Getting little or no
> change.
> 
> 
> 
> for mounting:
> logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m
> 
> 
> 
> (I know that noatime also means nodiratime according xfs.org, but in
> the past I seem to get better numbers when having both)
> 
> I am using nobarrier because I have a battery backed raid cache and the
> FAQ @ XFS.org seems to indicate that is the right choice.
> 
> 
> FWIW, if I put sunit and swidth in the mount options it seems to change
> them lower (when viewed with xfs_info) so I haven't been putting it in
> the mount options.
> 
> 
> 
> 
> verify readahead:
> blockdev --getra /dev/cciss/c0d0
> 16384
> 
> 
> 
> 
> 
> 
> 
> If anyone wants the benchmark outputs I can send them, but basically
> zcav being FLAT for bother MB/s and access time tells me something is
> wrong. And it will take days for me to re run all the ones I have done.
> I didn't save much once I saw results that don't fit with what I
> thought I should get.
> 
> 
> 
> 
> I haven't done much with pgbench yet as I figure its pointless to move
> on while the raw I/O numbers look off to me. At that time I am going to
> make the call between wal on the OS raid 1 or going to 10 data disks
> and 2 os and 2 wal.
> 
> 
> 
> 
> 
> 
> I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the
> issue went away, it didn't. I have gone back to 2.6.18-238.5 and put in
> a new CCISS driver directly from HP, and the issue also does not go
> away. People at work are thinking it might kernel bug that we have
> somehow never notice before which is why we are going to look at RHEL
> 6.1.  we tried a 5.3 kernel that someone on rh bugzilla said didn't
> have the issue but this blade had a fit with it - no network, lots of
> other stuff not working and then it kernel panic'd so we quickly gave
> up on that...
> 
> 
> 
> We may try and shoehorn in the 6.1 kernel and a few dependencies as
> well. Moving to RHEL 6.1. will mean a long test period before it can go
> into prod and we want to get this new hardware in sooner than that can
> be done.  (even with all it's problems its probably still faster than
> what it is replacing just from the 48GB of ram and 3 gen newer CPUS)
> 
> 
> 
> 
> 
> 
> 
> 
> Hardware and config stuff as it sits right now.
> 
> 
> 
> Blade Hardware:
> ProLiant BL460c G7 (bios power flag set to high performance)
> 2 intel 5660 cpus. (HT left on)
> 48GB of ram (12x4GB @ 1333MHz)
> Smart Array P410i (Embedded)
> Points of interest from hpacucli -
> 	- Hardware Revision: Rev C
> 	- Firmware Version: 3.66
> 	- Cache Board Present: True
> 	- Elevator Sort: Enabled
> 	- Cache Status: OK
> 	- Cache Backup Power Source: Capacitors
>    	- Battery/Capacitor Count: 1
>    	- Battery/Capacitor Status: OK
> 	- Total Cache Size: 512 MB
>    	- Accelerator Ratio: 25% Read / 75% Write
> 	- Strip Size: 256 KB
> 	- 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
> 	- Array Accelerator: Enabled
> 	- Status: OK
> 	- drives firmware = HPD5
> 
> Blade Storage subsystem:
> HP SB2200 (12 disk 15K )
> 
> Points of interest from hpacucli
> 
> Smart Array P410i in Slot 3
>    Controller Status: OK
>    Hardware Revision: Rev C
>    Firmware Version: 3.66
>    Elevator Sort: Enabled
>    Wait for Cache Room: Disabled
>    Cache Board Present: True
>    Cache Status: OK
>    Accelerator Ratio: 25% Read / 75% Write
>    Drive Write Cache: Disabled
>    Total Cache Size: 1024 MB
>    No-Battery Write Cache: Disabled
>    Cache Backup Power Source: Capacitors
>    Battery/Capacitor Count: 1
>    Battery/Capacitor Status: OK
>    SATA NCQ Supported: True
> 
> 
>       Logical Drive: 1
>          Size: 820.2 GB
>          Fault Tolerance: RAID 1+0
>          Heads: 255
>          Sectors Per Track: 32
>          Cylinders: 65535
>          Strip Size: 256 KB
>          Status: OK
>          Array Accelerator: Enabled
>          Disk Name: /dev/cciss/c0d0
>          Mount Points: /raid 820.2 GB
>          OS Status: LOCKED
> 
> 12 drives in Raid 1+0, using XFS.
> 
> 
> OS:
> OS: RHEL 5.6 (2.6.18-238.9.1.el5)
> Database use: PG 9.0.2 for OLTP.
> 
> 
> CCISS info:
> filename:       /lib/modules/2.6.18-
> 238.9.1.el5/kernel/drivers/block/cciss.ko
> version:        3.6.22-RH1
> description:    Driver for HP Controller SA5xxx SA6xxx version 3.6.22-
> RH1
> author:         Hewlett-Packard Company
> 
> XFS INFO:
> xfsdump-2.2.48-3.el5
> xfsprogs-2.10.2-7.el5
> 
> head of ZONEINFO while zcav is running and kswap is going nuts:
> the min,low,high of 1 seems odd to me. On other systems these get above
> 1.
> 
> Node 0, zone      DMA
>   pages free     2493
>         min      1
>         low      1
>         high     1
>         active   0
>         inactive 0
>         scanned  0 (a: 3 i: 3)
>         spanned  4096
>         present  2393
>     nr_anon_pages 0
>     nr_mapped    1
>     nr_file_pages 0
>     nr_slab      0
>     nr_page_table_pages 0
>     nr_dirty     0
>     nr_writeback 0
>     nr_unstable  0
>     nr_bounce    0
>     numa_hit     0
>     numa_miss    0
>     numa_foreign 0
>     numa_interleave 0
>     numa_local   0
>     numa_other   0
>         protection: (0, 3822, 24211, 24211)
>   pagesets
>   all_unreclaimable: 1
>   prev_priority:     12
>   start_pfn:         0
> 
> 
> 
> numastat (probably worthless since I have been pounding on this box for
> a while before capturing it)
> 
>                            node0           node1
> numa_hit              3126413031       247696913
> numa_miss               95489353      2781917287
> numa_foreign          2781917287        95489353
> interleave_hit             81178           97872
> local_node            3126297257       247706110
> other_node              95605127      2781908090
> 
> 
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance