Re: Slow Performance - Sequential IO

Christian Balzer <chibi@xxxxxxx> · Sat, 18 Jan 2020 13:13:48 +0900

Hello,

I had very odd results in the past with the fio rbd engine and would
suggest testing things in the environment you're going to deploy in, end
to end.

That said, without any caching and coalescing of writes, sequential 4k
writes will hit the same set of OSDs for 4MB worth of data, thus limiting
things to whatever the overall latency (network, 3x write) is here.
With random writes you will engage more or less all OSDs that hold your
fio file, thus spreading things out.
This becomes more and more visible with increasing number of OSDs and
nodes.

Regards,

Christian
On Fri, 17 Jan 2020 23:01:09 +0000 Anthony Brandelli (abrandel) wrote:

> Not been able to make any headway on this after some significant effort.
> 
> -Tested all 48 SSDs with FIO directly, all tested with 10% of each other for 4k iops in rand|seq read|write.
> -Disabled all CPU power save.
> -Tested with both rbd cache enabled and disabled on the client.
> -Tested with drive caches enabled and disabled (hdparm)
> -Minimal TCP retransmissions under load (<10 for a 2 minute duration).
> -No drops/pause frames noted on upstream switches.
> -CPU load on OSD nodes peaks at 6~.
> -iostat shows a peak of 15ms under read/write workloads, %util peaks at about 10%.
> -Swapped out the RBD client for a bigger box, since the load was peaking at 16. Now a 24 core box, load still peaks at 16.
> -Disabled cephx signatures
> -Verified hardware health (nothing in dmesg, nothing in CIMC fault logs, storage controller logs)
> -Test multiple SSDs at once to find the controllers iops limit, which is apparently 650k @ 4k.
> 
> Nothing has made a noticeable difference here. I'm pretty baffled as to what would be causing the awful sequential read and write performance, but allowing good random r/w speeds.
> 
> I switched up fio testing methodologies to use more threads, but this didn't seem to help either:
> 
> [global]
> bs=4k
> ioengine=rbd
> iodepth=32
> size=5g
> runtime=120
> numjobs=4
> group_reporting=1
> pool=rbd_af1
> rbdname=image1
> 
> [seq-read]
> rw=read
> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> Any pointers are appreciated at this point. I've been following other threads on the mailing list, and looked at the archives, related to RBD performance but none of the solutions that worked for others seem to have helped this setup.
> 
> Thanks,
> Anthony
> 
> ________________________________
> From: Anthony Brandelli (abrandel) <abrandel@xxxxxxxxx>
> Sent: Tuesday, January 14, 2020 12:43 AM
> To: ceph-users@xxxxxxxxxxxxxx <ceph-users@xxxxxxxxxxxxxx>
> Subject: Slow Performance - Sequential IO
> 
> 
> I have a newly setup test cluster that is giving some surprising numbers when running fio against an RBD. The end goal here is to see how viable a Ceph based iSCSI SAN of sorts is for VMware clusters, which require a bunch of random IO.
> 
> 
> 
> Hardware:
> 
> 2x E5-2630L v2 (2.4GHz, 6 core)
> 
> 256GB RAM
> 
> 2x 10gbps bonded network, Intel X520
> 
> LSI 9271-8i, SSDs used for OSDs in JBOD mode
> 
> Mons: 2x 1.2TB 10K SAS in RAID1
> 
> OSDs: 12x Samsung MZ6ER800HAGL-00003 800GB SAS SSDs, super cap/power loss protection
> 
> 
> 
> Cluster setup:
> 
> Three mon nodes, four OSD nodes
> 
> Two OSDs per SSD
> 
> Replica 3 pool
> 
> Ceph 14.2.5
> 
> 
> 
> Ceph status:
> 
>   cluster:
> 
>     id:     e3d93b4a-520c-4d82-a135-97d0bda3e69d
> 
>     health: HEALTH_WARN
> 
>             application not enabled on 1 pool(s)
> 
>   services:
> 
>     mon: 3 daemons, quorum mon1,mon2,mon3 (age 6d)
> 
>     mgr: mon2(active, since 6d), standbys: mon3, mon1
> 
>     osd: 96 osds: 96 up (since 3d), 96 in (since 3d)
> 
>   data:
> 
>     pools:   1 pools, 3072 pgs
> 
>     objects: 857.00k objects, 1.8 TiB
> 
>     usage:   432 GiB used, 34 TiB / 35 TiB avail
> 
>     pgs:     3072 active+clean
> 
> 
> 
> Network between nodes tests at 9.88gbps. Direct testing of the SSDs using a 4K block in fio shows 127k seq read, 86k randm read, 107k seq write, 52k random write. No high CPU load/interface saturation is noted when running tests against the rbd.
> 
> 
> 
> When testing with a 4K block size against an RBD on a dedicated metal test host (same specs as other cluster nodes noted above) I get the following (command similar to fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=32 -rw=XXXX -pool=scbench -runtime=60 -rbdname=datatest):
> 
> 
> 
> 10k sequential read iops
> 
> 69k random read iops
> 
> 13k sequential write iops
> 
> 22k random write iops
> 
> 
> 
> I’m not clear why the random ops, especially read, would be so much quicker compared to the sequential ops.
> 
> 
> 
> Any points appreciated.
> 
> 
> 
> Thanks,
> 
> Anthony

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Mobile Inc.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com