Re: All SSD cluster performance

Wido den Hollander <wido@xxxxxxxx> · Fri, 13 Jan 2017 20:54:46 +0100 (CET)

> Op 13 januari 2017 om 20:33 schreef Mohammed Naser <mnaser@xxxxxxxxxxxx>:
> 
> 
> 
> > On Jan 13, 2017, at 1:34 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > 
> >> 
> >> Op 13 januari 2017 om 18:50 schreef Mohammed Naser <mnaser@xxxxxxxxxxxx>:
> >> 
> >> 
> >> 
> >>> On Jan 13, 2017, at 12:41 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >>> 
> >>> 
> >>>> Op 13 januari 2017 om 18:39 schreef Mohammed Naser <mnaser@xxxxxxxxxxxx>:
> >>>> 
> >>>> 
> >>>> 
> >>>>> On Jan 13, 2017, at 12:37 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >>>>> 
> >>>>> 
> >>>>>> Op 13 januari 2017 om 18:18 schreef Mohammed Naser <mnaser@xxxxxxxxxxxx>:
> >>>>>> 
> >>>>>> 
> >>>>>> Hi everyone,
> >>>>>> 
> >>>>>> We have a deployment with 90 OSDs at the moment which is all SSD that’s not hitting quite the performance that it should be in my opinion, a `rados bench` run gives something along these numbers:
> >>>>>> 
> >>>>>> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
> >>>>>> Object prefix: benchmark_data_bench.vexxhost._30340
> >>>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> >>>>>>  0       0         0         0         0         0           -           0
> >>>>>>  1      16       158       142   568.513       568   0.0965336   0.0939971
> >>>>>>  2      16       287       271   542.191       516   0.0291494    0.107503
> >>>>>>  3      16       375       359    478.75       352   0.0892724    0.118463
> >>>>>>  4      16       477       461   461.042       408   0.0243493    0.126649
> >>>>>>  5      16       540       524   419.216       252    0.239123    0.132195
> >>>>>>  6      16       644       628    418.67       416    0.347606    0.146832
> >>>>>>  7      16       734       718   410.281       360   0.0534447    0.147413
> >>>>>>  8      16       811       795   397.487       308   0.0311927     0.15004
> >>>>>>  9      16       879       863   383.537       272   0.0894534    0.158513
> >>>>>> 10      16       980       964   385.578       404   0.0969865    0.162121
> >>>>>> 11       3       981       978   355.613        56    0.798949    0.171779
> >>>>>> Total time run:         11.063482
> >>>>>> Total writes made:      981
> >>>>>> Write size:             4194304
> >>>>>> Object size:            4194304
> >>>>>> Bandwidth (MB/sec):     354.68
> >>>>>> Stddev Bandwidth:       137.608
> >>>>>> Max bandwidth (MB/sec): 568
> >>>>>> Min bandwidth (MB/sec): 56
> >>>>>> Average IOPS:           88
> >>>>>> Stddev IOPS:            34
> >>>>>> Max IOPS:               142
> >>>>>> Min IOPS:               14
> >>>>>> Average Latency(s):     0.175273
> >>>>>> Stddev Latency(s):      0.294736
> >>>>>> Max latency(s):         1.97781
> >>>>>> Min latency(s):         0.0205769
> >>>>>> Cleaning up (deleting benchmark objects)
> >>>>>> Clean up completed and total clean up time :3.895293
> >>>>>> 
> >>>>>> We’ve verified the network by running `iperf` across both replication and public networks and it resulted in 9.8Gb/s (10G links for both).  The machine that’s running the benchmark doesn’t even saturate it’s port.  The SSDs are S3520 960GB drives which we’ve benchmarked and they can handle the load using fio/etc.  At this point, not really sure where to look next.. anyone running all SSD clusters that might be able to share their experience?
> >>>>> 
> >>>>> I suggest that you search a bit on the ceph-users list since this topic has been discussed multiple times in the past and even recently.
> >>>>> 
> >>>>> Ceph isn't your average storage system and you have to keep that in mind. Nothing is free in this world. Ceph provides excellent consistency and distribution of data, but that also means that you have things like network and CPU latency.
> >>>>> 
> >>>>> However, I suggest you look up a few threads on this list which have valuable tips.
> >>>>> 
> >>>>> Wido
> >>>> 
> >>>> Thanks for the reply, I’ve actually done quite a lot of research and went through many of the previous posts. While I agree a 100% with your statement, I’ve found that other people with similar setups have been able to reach numbers that I cannot, which leads me to believe that there is actually an issue in here.  They have been able to max out at 1200 MB/s which is the maximum of their benchmarking host.  We’d like to reach that and I think that given the specifications of the cluster, it can do it with no problems.
> >>> 
> >>> A few tips:
> >>> 
> >>> - Disable all logging in Ceph (debug_osd, debug_ms, debug_auth, etc, etc)
> >> 
> >> All logging is configured to default settings, should those be turned down?
> > 
> > Yes, disable all logging improves performance.
> 
> I’ll look into disabling it.

Good, you can do it on the fly:

$ ceph tell osd.* injectargs '--debug_osd=0/0'
$ ceph tell osd.* injectargs '--debug_ms=0/0'
$ ceph tell osd.* injectargs '--debug_filestore=0/0'

> 
> > 
> >> 
> >>> - Disable power saving on the CPUs
> >> 
> >> All disabled as well, everything running on `performance` mode.
> >> 
> >>> 
> >>> Can you also share how the 90 OSDs are distributed in the cluster and what CPUs you have?
> >> 
> >> There are 45 machines with 2 OSDs each.   The servers they’re located on on average have 24 core ~3GHz Intel CPUs.  Both OSDs are pinned to two cores on the system.
> >> 
> > 
> > So 45 machines in total with 2 OSDs/SSDs each.
> > 
> > What is the network? 10GbE? What is the latency for a 8k packet? (ping -s 8192)
> 
> It is a 10GbE network, the latency is on average 0.217ms

Ok, that's good.

> 
> > 
> > Also try running rados bench with more threads, 16 isn't that much. Try running with 128 or so from multiple clients.
> 
> With 128 threads, I’m able to get an average of 900.  Every drive seems to average out to ~20MBps on that peak.  Looking at running it multiple times seems to introduce very odd issues with extra data… is multiple rados bench runs not supported?
> 

Isn't the 900MB/sec almost saturating the client?

You can run 'rados bench' from multiple machines at the same time.

Got my thinking, with 90 OSDs and using 3x replication you should have around 3000 PGs for that pool. How many PGs does it currently have?

Wido

> > 
> > Wido
> > 
> >>> 
> >>> Wido
> >>> 
> >>>> 
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> Mohammed
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com