Re: Ceph, SSD, and NVMe

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Wed, 30 Sep 2015 19:33:06 +0000

David,
You should move to Hammer to get all the benefits of performance. It's all added to Giant and migrated to the present hammer LTS release.
FYI, focus was so far with read performance improvement and what we saw in our environment with 6Gb SAS SSDs so far that we are able to saturate drives BW wise with 64K onwards. But, with smaller block like 4K we are not able to saturate the SAS SSD drives yet.
But, considering Ceph's scale out nature you can get some very good numbers out of a cluster. For example, with 8 SAS SSD drives (in a JBOF) and having 2 heads in front (So, a 2 node Ceph cluster) we are able to hit ~300K Random read iops while 8 SSD aggregated performance would be ~400K. Not too bad. At this point we are saturating host cpus.
We have seen almost linear scaling if you add similar setups i.e adding say ~3 of the above setup, you could hit ~900K RR iops. So, I would say it is definitely there in terms read iops and more improvement are coming.
But, write path is very awful compare to read and that's where the problem is. Because, in the mainstream, no workload is 100% RR (IMO). So,  even if you have say 90-10 read/write the performance numbers would be  ~6/7 X slower.
So, it is very much dependent on your workload/application access pattern and obviously the cost you are willing to spend.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Wednesday, September 30, 2015 12:04 PM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Ceph, SSD, and NVMe

On 09/30/2015 09:34 AM, J David wrote:
> Because we have a good thing going, our Ceph clusters are still
> running Firefly on all of our clusters including our largest, all-SSD
> cluster.
>
> If I understand right, newer versions of Ceph make much better use of
> SSDs and give overall much higher performance on the same equipment.
> However, the impression I get of newer versions is that they are also
> not as stable as Firefly and should only be used with caution.
>
> Given our storage consumers have an effectively unlimited appetite for
> IOPs and throughput, more performance would be very welcome.  But not
> if it leads to cluster crashes and lost data.
>
> What really prompts this is that we are starting to see large-scale
> NVMe equipment appearing in the channel ( e.g.
> http://www.supermicro.com/products/system/1U/1028/SYS-1028U-TN10RT_.cf
> m ).  The cost is significantly higher with commensurately higher
> theoretical perfomance.  But if we're already not pushing our SSD's to
> the max over SAS, the added benefit of NVMe would largely be lost.
>
> On the other hand, if we could safely upgrade to a more recent version
> that is as stable and bulletproof as Firefly has been for us, but has
> better performance with SSDs, that would not only benefit our current
> setup, it would be a necessary first step for moving onto NVMe.
>
> So this raises three questions:
>
> 1) Have I correctly understood that one or more post-FireFly releases
> exist that (c.p.) perform significantly better with all-SSD setups?
>
> 2) Is there any such release that (generally) is as rock-solid as
> FireFly.  Of course this is somewhat situationally dependent, so I
> would settle for: is there any such release that doesn't have any
> known minding-my-own-business-suddenly-lost-data bugs in a 100% RBD
> use case?
>
> 3) Has anyone done anything with NVMe as storage (not just journals)
> who would care to share what kind of performance they experienced?
>
> (Of course if we do upgrade we will do so carefully, do a test cluster
> first, have backups standing by, etc.  But if it's already known that
> doing so will either not improve anything or is likely to blow up in
> our faces, it would be better to leave well enough alone.  The current
> performance is by no means bad, we're just always greedy for more. :)
> )
>
> Thanks for any advice/suggestions!

Hi David,

The single biggest performance improvement we've seen for SSDs has resulted from the memory allocator investigation that Chaitanya Hulgol and Somnath Roy spearheaded at Sandisk and others including myself have followed up and tried to expand on since then.

See:

http://www.spinics.net/lists/ceph-devel/msg25823.html
https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg23100.html
http://www.spinics.net/lists/ceph-devel/msg21582.html

I haven't tested firefly, but there's a good chance that you may see a significant performance improvement simply by upgrading your systems to tcmalloc 2.4 and loading the OSDs with 128MB of thread cache or LD_PRELOAD jemalloc.  This isn't something we officially support in RHCS yet, but we'll likely be moving toward it for future releases based on the very positive results we are seeing.  The biggest thing to keep in mind is that this does increase per-OSD memory usage by several hundred MB, so 3-4X IOPS increase does come with a cost.  On the plus side, it also reduces CPU usage, sometimes dramatically.  You may be able to offset the increased memory usage somewhat by disabling transparent huge pages (especially with jemalloc).

See:

http://www.spinics.net/lists/ceph-devel/msg26483.html

FWIW, between Sage's newstore work, and recent work by Somnath Roy to optimize the write path, we may see further improvement, but neither of those are ready for production yet.

> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com