Re: CEPH with NVMe SSDs and Caching vs Journaling on SSDs

Christian Balzer <chibi@xxxxxxx> · Tue, 21 Jun 2016 08:47:43 +0900

Hello,

On Mon, 20 Jun 2016 15:12:49 +0000 Tim Gipson wrote:

> Christian,
> 
> Thanks for all the info. I’ve been looking over the mailing lists.
> There is so much info there and from the looks of it, setting up a cache
> tier is much more complex than I had originally thought.  
> 
More complex, yes.
But depending on your use case also potentially very rewarding.

> Moving the journals to OSDs was much simpler for me because you can just
> use ceph-deploy and point the journal to the device you want.
> 
SSD journals for HDD OSDs is always a good first step.

> I do understand the difference between the cache tier and journaling.
> 
> As per your comment about the monitor nodes, the extra monitor nodes are
> for the purpose of resiliency.  We are trying to build our storage and
> compute clusters with lots of failure in mind.
> 
Since you have the HW already, not much of a point, but things would have
been more resilient and performant with 1 dedicated monitor node and 4
storage nodes also running MONs.

Note that more than 5 monitors is considered counterproductive in nearly
all cases.

> Our NVME drives are only the 800GB 3600 series.
> 
That's 2.4TB per day, or a mere 28MB/s when looking at the endurance of
these NVMes. 
Not accounting for any write amplification (which should be negligible
with journals).

Wouldn't be an issue in my use case, but YMMW, so monitor the wearout. 

> As to our networking setup: The OSD nodes have 4 x 10G nics, a bonded
> pair for front end traffic and a bonded pair for cluster traffic.  The
> monitor nodes have a bonded pair of 1Gig nics.  Our clients have 4 x 10G
> nics as well with a bonded pair dedicated to storage front end traffic
> connected to the ceph cluster.
> 
Overkill, as your storage nodes are limited by the 1GB/s of the P3600s.

In short, a split network in your case is a bit of a waste, as your reads
are potentially hampered by it.

Read the recent "Best Network Switches for Redundancy" for example.

> The single NVMe for journaling was a concern but as you mentioned
> before, a host is our failure domain at this point.
> 
And with that in mind, don't fill your OSDs more than 60%.

Christian

> I did find your comments to another user about having to add multiple
> roots per node because their NVMe drives were on different nodes.  That
> is the case for our gear as well.
> 
> Also, my gear is already in house so I’ve got what I’ve got to work with
> at this point, for good for ill.
> 
> Tim Gipson
> 
> 
> On 6/16/16, 7:47 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> 
> Hello,
> 
> On Thu, 16 Jun 2016 15:31:13 +0000 Tim Gipson wrote:
> 
> > A few questions.
> > 
> > First, is there a good step by step to setting up a caching tier with
> > NVMe SSDs that are on separate hosts?  Is that even possible?
> > 
> Yes. And with a cluster of your size that's the way I'd do it.
> Larger cluster (dozen plus nodes) are likely to be better suited with
> storage nodes that have shared HDD OSDs for slow storage and SSD OSDs for
> cache pools.
> 
> It would behoove you to scour this ML for the dozens of threads covering
> this and other aspects, like:
> "journal or cache tier on SSDs ?"
> "Steps for Adding Cache Tier"
> and even yesterdays:
> "Is Dynamic Cache tiering supported in Jewel"
> 
> > Second, what sort of performance are people seeing from caching
> > tiers/journaling on SSDs in Jewel?
> > 
> Not using Jewel, but it's bound to be better than Hammer.
> 
> Performance will depend on a myriad of things, including CPU, SSD/NVMe
> models, networking, tuning, etc.
> It would be better if you had a performance target and a budget to see if
> they can be matched up.
> 
> Cache tiering and journaling are very different things, don't mix them
> up.
> 
> > Right now I am working on trying to find best practice for a CEPH
> > cluster with 3 monitor nodes, and 3 OSDs with 1 800GB NVMe drive and 12
> > 6TB drives.
> > 
> No need for dedicated monitor notes (definitely not 3 and with cluster of
> that size) if your storage nodes are designed correctly, see for example:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008879.html
> 
> > My goal is reliable/somewhat fast performance.
> >
> Well, for starters this cluster will give you the space of one of these
> nodes and worse performance than a single node due to the 3x replication.
> 
> What NVMe did you have in mind, a DC P3600 will give you 1GB/s writes
> (and 3DWPD endurance), a P3700 2GB/s (and 10DWPD endurance).
> 
> What about your network?
> 
> Since the default failure domain in Ceph is the host, a single NVMe as
> journal for all HDD OSDs isn't particular risky, but it's something to
> keep in mind.
>  
> Christian
> > Any help would be greatly appreciated!
> > 
> > Tim Gipson
> > Systems Engineer
> > 
> > [http://www.ena.com/signature/enaemaillogo.gif]<http://www.ena.com/>
> > 
> > 
> > 618 Grassmere Park Drive, Suite 12
> > Nashville, TN 37211
> > 
> > 
> > 
> > website<http://www.ena.com/> | blog<http://www.ena.com/blog> |
> > support<http://support.ena.com/>
> > 
> > 
> > [http://www.ena.com/signature/facebook.png]<http://www.facebook.com/ENAConnects>
> > [http://www.ena.com/signature/twitter.png]
> > <http://twitter.com/#!/ENAConnects/>
> > [http://www.ena.com/signature/linkedin.png]
> > <http://www.linkedin.com/company/15330>
> > [http://www.ena.com/signature/youtube.png]
> > <https://www.youtube.com/user/EducationNetworks>
> > 
> > 
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com