CRUSH map advice

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Tue, 12 Aug 2014 10:53:21 -0700

On Mon, Aug 11, 2014 at 11:26 PM, John Morris <john at zultron.com> wrote:

> On 08/11/2014 08:26 PM, Craig Lewis wrote:
>
>> Your MON nodes are separate hardware from the OSD nodes, right?
>>
>
> Two nodes are OSD + MON, plus a separate MON node.
>
>
>  If so,
>> with replication=2, you should be able to shut down one of the two OSD
>> nodes, and everything will continue working.
>>
>
> IIUC, the third MON node is sufficient for a quorum if one of the OSD +
> MON nodes shuts down, is that right?
>

So yeah, if you lose any one node, you'll be fine.

>
> Replication=2 is a little worrisome, since we've already seen two disks
> simultaneously fail just in the year the cluster has been running.  That
> statistically unlikely situation is the first and probably last time I'll
> see that, but they say lightning can strike twice....

That's a low probability, given the number of disks you have.  I would've
taken that bet (with backups).  As the number of OSDs goes up, the
probability of multiple simultaneous failures goes up, and slowly becomes a
bad bet.

>
>
>  Since it's for
>> experimentation, I wouldn't deal with the extra hassle of replication=4
>> and custom CRUSH rules to make it work.  If you have your heart set on
>> that, it should be possible.  I'm no CRUSH expert though, so I can't say
>> for certain until I've actually done it.
>>
>> I'm a bit confused why your performance is horrible though.  I'm
>> assuming your HDDs are 7200 RPM.  With the SSD journals and
>> replication=3, you won't have a ton of IO, but you shouldn't have any
>> problem doing > 100 MB/s with 4 MB blocks.  Unless your SSDs are very
>> low quality, the HDDs should be your bottleneck.
>>
>
> The below setup is tomorrow's plan; today's reality is 3 OSDs on one node
> and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck unclean and no
> monitoring to pinpoint bottlenecks.  My work is cut out for me.  :)
>
> Thanks for the helpful reply.  I wish we could just add a third OSD node
> and have these issues just go away, but it's not in the budget ATM.
>
>
Ah, yeah, that explains the performance problems.  Although, crappy SSD
journals are still better than no SSD journals.  When I added SSD journals
to my existing cluster, I saw my write bandwidth go from 10 MBps/disk to
50MBps/disk.  Average latency dropped a bit, and the variance in latency
dropped a lot.

Just adding more disks to your existing nodes would help performance,
assuming you have room to add them.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140812/03bb6855/attachment.htm>