anti-cephalopod question

chibi@xxxxxxx (Christian Balzer) · Sat, 26 Jul 2014 14:47:11 +0900

Hello,

actually replying in the other thread was fine by me, it was after
relevant in a sense to it.
And you mentioned something important there, which you didn't mention
below, that you're coming from DRBD with a lot of experience there.

So do I and Ceph/RBD simply isn't (and probably never will be) an adequate
replacement for DRBD in some use cases. 
I certainly plan to keep deploying DRBD where it makes more sense
(IOPS/speed), while migrating everything else to Ceph.

Anyway, lets look at your mail:

On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:

> I've a question regarding advice from these threads:
> https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
> 
> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> 
> 
> 
>  Our current setup has 4 osd's per node.    When a drive  fails   the
> cluster is almost unusable for data entry.   I want to change our set up
> so that under no circumstances ever happens.
> 

While you can pretty much avoid this from happening, your cluster should
be able to handle a recovery.
While Ceph is a bit more hamfisted than DRBD and definitely needs more
controls and tuning to make recoveries have less of an impact you would
see something similar with DRBD and badly configured recovery speeds.

In essence, if your current setup can't handle the loss of a single disk,
what happens if a node fails?
You will need to design (HW) and configure (various Ceph options) your
cluster to handle these things because at some point a recovery might be
unavoidable. 

To prevent recoveries based on failed disks, use RAID, for node failures
you could permanently set OSD noout or have a monitoring software do that
when it detects a node failure.

>  Network:  we use 2 IB switches and  bonding in fail over mode.
>  Systems are two  Dell Poweredge r720 and Supermicro X8DT3 .
> 

I'm confused. Those Dells tend to have 8 drive bays normally, don't they?
So you're just using 4 HDDs for OSDs? No SSD journals?
Just 2 storage nodes? 
Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual
disk failures. Which will happen. 

Also that SM product number is for a motherboard, not a server, is that
your monitor host?
Anything production with data on in that you value should have 3 mon
hosts, if you can't afford dedicated ones sharing them on an OSD node
(preferably with the OS on SSDs to keep leveldb happy) is better than just
one, because if that one dies or gets corrupted, your data is inaccessible.

>  So looking at how to do things better we will try  '#4- anti-cephalopod'
> .   That is a seriously funny phrase!
> 
> We'll switch to using raid-10 or raid-6 and have one osd per node, using
> high end raid controllers,  hot spares etc.
> 
Are you still talking about the same hardware as above, just 4 HDDs for
storage? 
With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you
have more bays use up to 12 for RAID6 with a high performance and large
HW cache controller.  

> And use one Intel 200gb S3700 per node for journal
> 
That's barely enough for 4 HDDs at 365MB/s write speed, but will do
nicely if those are in a RAID10 (half speed of individual drives). 
Keep in mind that your node will never be able to write faster than the
speed of your journal.

> My questions:
> 
> is there a minimum number of OSD's which should be used?
> 
If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes
is sufficient to begin with. 
However your performance might not be what you expect (an OSD process
seems to be incapable of doing more than 800 write IOPS). 
But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's
not so much of an issue. 
In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it
certainly is rather frustrating.

In short, the more nodes (OSDs) you can deploy, the better the
performance will be. And of course in case a node dies and you don't
think it can be brought back in a sensible short time frame, having more
than 2 nodes will enable you to do a recovery/rebalance and restore your
redundancy to the desired level. 

> should  OSD's per node be the same?
> 
It is advantageous to have identical disks and OSD sizes, makes the whole
thing more predictable and you don't have to play with weights.

As for having different number of OSDs per node, consider this example:

4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size).
What will happen here is that all the replicas from single OSD nodes might
wind up on the 4 OSD node. So it better have more power in all aspects
than the single OSD nodes.
Now that node fails and you decide to let things rebalance as it can't be
repaired shortly. But you cluster was half full and now it will be 100%
full and become unusable (for writes). 

So the moral of the story, deploy as much identical HW as possible. 

Christian

> best regards, Rob
> 
> 
> PS:  I had asked above in middle of another thread...  please ignore
> there.

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/