Re: ceph cluster performance

Dinu Vlad <dinuvlad13@xxxxxxxxx> · Tue, 5 Nov 2013 13:15:25 +0200

Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1) 

This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad <dinuvlad13@xxxxxxxxx> wrote:

> 
> I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before. 
> 
> With a single host, the pgs were "stuck unclean" (active only, not active+clean):
> 
> # ceph -s
>  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
>   health HEALTH_WARN 1800 pgs stuck unclean
>   monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>   osdmap e101: 18 osds: 18 up, 18 in
>    pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail
>   mdsmap e1: 0/0/1 up
> 
> 
> Test results: 
> Local test, 1 process, 16 threads: 241.7 MB/s
> Local test, 8 processes, 128 threads: 374.8 MB/s
> Remote test, 1 process, 16 threads: 231.8 MB/s
> Remote test, 8 processes, 128 threads: 366.1 MB/s
> 
> Maybe it's just me, but it seems on the low side too. 
> 
> Thanks,
> Dinu
> 
> 
> On Oct 30, 2013, at 8:59 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
> 
>> On 10/30/2013 01:51 PM, Dinu Vlad wrote:
>>> Mark,
>>> 
>>> The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.
>>> 
>>> The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.
>>> 
>>> I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered):
>>> 
>>> Sequential:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec
>>> 
>>> Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s
>> 
>> Ok, that looks like what I'd expect to see given the controller being used.  SSDs are probably limited by total aggregate throughput.
>> 
>>> 
>>> Random:
>>> Run status group 0 (all jobs):
>>>  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec
>>> 
>>> Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101)
>>> 
>>> This is on just one of the osd servers.
>> 
>> Where the ceph tests to one OSD server or across all servers?  It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens.
>> 
>>> 
>>> Thanks,
>>> Dinu
>>> 
>>> 
>>> On Oct 30, 2013, at 6:38 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:
>>> 
>>>> On 10/30/2013 09:05 AM, Dinu Vlad wrote:
>>>>> Hello,
>>>>> 
>>>>> I've been doing some tests on a newly installed ceph cluster:
>>>>> 
>>>>> # ceph osd create bench1 2048 2048
>>>>> # ceph osd create bench2 2048 2048
>>>>> # rbd -p bench1 create test
>>>>> # rbd -p bench1 bench-write test --io-pattern rand
>>>>> elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36
>>>>> 
>>>>> # rados -p bench2 bench 300 write --show-time
>>>>> # (run 1)
>>>>> Total writes made:      20665
>>>>> Write size:             4194304
>>>>> Bandwidth (MB/sec):     274.923
>>>>> 
>>>>> Stddev Bandwidth:       96.3316
>>>>> Max bandwidth (MB/sec): 748
>>>>> Min bandwidth (MB/sec): 0
>>>>> Average Latency:        0.23273
>>>>> Stddev Latency:         0.262043
>>>>> Max latency:            1.69475
>>>>> Min latency:            0.057293
>>>>> 
>>>>> These results seem to be quite poor for the configuration:
>>>>> 
>>>>> MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
>>>>> OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller.
>>>>> All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals.
>>>> 
>>>> Agreed, you should see much higher throughput with that kind of storage setup.  What brand/model SSDs are these?  Also, what brand and model of chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side.
>>>> 
>>>> I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes.  Typically I've tested fio on top of a filesystem on RBD.
>>>> 
>>>>> 
>>>>> Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows)
>>>>> 
>>>>> osd_journal_size = 10240
>>>>> osd mount options xfs = "rw,noatime,nobarrier,inode64"
>>>>> osd mkfs options xfs = "-f -i size=2048"
>>>>> 
>>>>> [osd]
>>>>> public network = 10.4.0.0/24
>>>>> cluster network = 10.254.254.0/24
>>>>> 
>>>>> All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics.
>>>>> 
>>>>> I've done a few other tests of the individual components:
>>>>> - network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
>>>>> - md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
>>>>> - fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS
>>>> 
>>>> What you might want to try doing is 4M direct IO writes using libaio and a high iodepth to all drives (spinning disks and SSDs) concurrently and see how both the per-drive and aggregate throughput is.
>>>> 
>>>> With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is something interesting about the way the hardware is setup on your system.
>>>> 
>>>>> 
>>>>> I'd appreciate any suggestion that might help improve the performance or identify a bottleneck.
>>>>> 
>>>>> Thanks
>>>>> Dinu
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com