НА: Ceph, SSD, and NVMe

Межов Игорь Александрович <megov@xxxxxxxxxx> · Fri, 2 Oct 2015 08:18:23 +0000

Hi!

Yes, we run a small Hammer cluster in production. 
Initially is was a 6-node Firefly installation on slightly outdated hardware: 
 - Intel 56XX platforms, 
 - 32-48Gb RAM, 
 - 70 SATA OSDs (1tb/2tb), 
 - SSD journals on DC S3700 200Gb,
 - 10Gbit interconnect
 - ~100 VM images (RBD only)

To run Hammer we decide not to upgrade an existing cluster, but build
a new one from scratch. We initially start Hammer on 2 nodes (E5-2670, 
128GRam, 20 SAS osds + 2x200 SSD each) than move images one by one
to a new cluster, shrink the old one and re-add freed hosts to a Hammer
as a separate SATA pool. We're in a middle of this process now and
when finished we will have 10 nodes (4-SAS, 6-SATA).

>So, do medium-sized IT organizations (i.e. those without the resources
>to have a Ceph developer on staff) run Hammer-based deployments in
>production successfully?

So we answer 'yes' on this question.  ;)

Megov Igor
CIO, Yuterra

________________________________________
От: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> от имени J David <j.david.lists@xxxxxxxxx>
Отправлено: 2 октября 2015 г. 5:01
Кому: Somnath Roy
Копия: ceph-users@xxxxxxxxxxxxxx
Тема: Re:  Ceph, SSD, and NVMe

This is all very helpful feedback, thanks so much.

Also it sounds like you guys have done a lot of work on this, so
thanks for that as well!

Is Hammer generally considered stable enough for production in an
RBD-only environment?  The perception around here is that the number
of people who report lost data or inoperable clusters due to bugs in
Hammer on this list is troubling enough to cause hesitation.  There's
a specific term for overweighting the probability of catastrophic
negative outcomes, and maybe that's what's happening.  People tend not
to post to the list "Hey we have a cluster, it's running great!"
instead waiting until things are not great, so the list paints an
artificially depressing picture of stability.  But when we ask around
quietly to other places we know running Ceph in production, which is
admittedly a very small sample, they're all also still running
Firefly.

Admittedly, it doesn't help that "On my recommendation, we performed a
non-reversible upgrade on the production cluster which, despite our
best testing efforts, wrecked things causing us to lose 4 hours of
data and requiring 2 days of downtime while we rebuilt the cluster and
restored the backups" is pretty much guaranteed to be followed by,
"You're fired."

So, do medium-sized IT organizations (i.e. those without the resources
to have a Ceph developer on staff) run Hammer-based deployments in
production successfully?

Please understand this is not meant to be sarcastic or critical of the
project in any way.  Ceph is amazing, and we love it.  Some features
of Ceph, like CephFS, have been considered not-production-quality for
a long time, and that is to be expected.  These things are incredibly
complex and take time to get right.  So organizations in our position
just don't use that stuff.  As a relative outsider for whom the Ceph
source code is effectively a foreign language, it's just *really* hard
to tell if Hammer in general is in that same "still baking" category.

Thanks!

On Wed, Sep 30, 2015 at 3:33 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> David,
> You should move to Hammer to get all the benefits of performance. It's all added to Giant and migrated to the present hammer LTS release.
> FYI, focus was so far with read performance improvement and what we saw in our environment with 6Gb SAS SSDs so far that we are able to saturate drives BW wise with 64K onwards. But, with smaller block like 4K we are not able to saturate the SAS SSD drives yet.
> But, considering Ceph's scale out nature you can get some very good numbers out of a cluster. For example, with 8 SAS SSD drives (in a JBOF) and having 2 heads in front (So, a 2 node Ceph cluster) we are able to hit ~300K Random read iops while 8 SSD aggregated performance would be ~400K. Not too bad. At this point we are saturating host cpus.
> We have seen almost linear scaling if you add similar setups i.e adding say ~3 of the above setup, you could hit ~900K RR iops. So, I would say it is definitely there in terms read iops and more improvement are coming.
> But, write path is very awful compare to read and that's where the problem is. Because, in the mainstream, no workload is 100% RR (IMO). So,  even if you have say 90-10 read/write the performance numbers would be  ~6/7 X slower.
> So, it is very much dependent on your workload/application access pattern and obviously the cost you are willing to spend.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> Sent: Wednesday, September 30, 2015 12:04 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph, SSD, and NVMe
>
> On 09/30/2015 09:34 AM, J David wrote:
>> Because we have a good thing going, our Ceph clusters are still
>> running Firefly on all of our clusters including our largest, all-SSD
>> cluster.
>>
>> If I understand right, newer versions of Ceph make much better use of
>> SSDs and give overall much higher performance on the same equipment.
>> However, the impression I get of newer versions is that they are also
>> not as stable as Firefly and should only be used with caution.
>>
>> Given our storage consumers have an effectively unlimited appetite for
>> IOPs and throughput, more performance would be very welcome.  But not
>> if it leads to cluster crashes and lost data.
>>
>> What really prompts this is that we are starting to see large-scale
>> NVMe equipment appearing in the channel ( e.g.
>> http://www.supermicro.com/products/system/1U/1028/SYS-1028U-TN10RT_.cf
>> m ).  The cost is significantly higher with commensurately higher
>> theoretical perfomance.  But if we're already not pushing our SSD's to
>> the max over SAS, the added benefit of NVMe would largely be lost.
>>
>> On the other hand, if we could safely upgrade to a more recent version
>> that is as stable and bulletproof as Firefly has been for us, but has
>> better performance with SSDs, that would not only benefit our current
>> setup, it would be a necessary first step for moving onto NVMe.
>>
>> So this raises three questions:
>>
>> 1) Have I correctly understood that one or more post-FireFly releases
>> exist that (c.p.) perform significantly better with all-SSD setups?
>>
>> 2) Is there any such release that (generally) is as rock-solid as
>> FireFly.  Of course this is somewhat situationally dependent, so I
>> would settle for: is there any such release that doesn't have any
>> known minding-my-own-business-suddenly-lost-data bugs in a 100% RBD
>> use case?
>>
>> 3) Has anyone done anything with NVMe as storage (not just journals)
>> who would care to share what kind of performance they experienced?
>>
>> (Of course if we do upgrade we will do so carefully, do a test cluster
>> first, have backups standing by, etc.  But if it's already known that
>> doing so will either not improve anything or is likely to blow up in
>> our faces, it would be better to leave well enough alone.  The current
>> performance is by no means bad, we're just always greedy for more. :)
>> )
>>
>> Thanks for any advice/suggestions!
>
> Hi David,
>
> The single biggest performance improvement we've seen for SSDs has resulted from the memory allocator investigation that Chaitanya Hulgol and Somnath Roy spearheaded at Sandisk and others including myself have followed up and tried to expand on since then.
>
> See:
>
> http://www.spinics.net/lists/ceph-devel/msg25823.html
> https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg23100.html
> http://www.spinics.net/lists/ceph-devel/msg21582.html
>
> I haven't tested firefly, but there's a good chance that you may see a significant performance improvement simply by upgrading your systems to tcmalloc 2.4 and loading the OSDs with 128MB of thread cache or LD_PRELOAD jemalloc.  This isn't something we officially support in RHCS yet, but we'll likely be moving toward it for future releases based on the very positive results we are seeing.  The biggest thing to keep in mind is that this does increase per-OSD memory usage by several hundred MB, so 3-4X IOPS increase does come with a cost.  On the plus side, it also reduces CPU usage, sometimes dramatically.  You may be able to offset the increased memory usage somewhat by disabling transparent huge pages (especially with jemalloc).
>
> See:
>
> http://www.spinics.net/lists/ceph-devel/msg26483.html
>
> FWIW, between Sage's newstore work, and recent work by Somnath Roy to optimize the write path, we may see further improvement, but neither of those are ready for production yet.
>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com