anti-cephalopod question

robertfantini@xxxxxxxxx (Robert Fantini) · Sun, 27 Jul 2014 18:20:43 -0400

Hello Christian,

Let me supply more info and answer some questions.

* Our main concern is high availability, not speed.
Our storage requirements are not huge.
However we want good keyboard response 99.99% of the time.   We mostly do
data entry and reporting.   20-25  users doing mostly order , invoice
processing and email.

* DRBD has been very reliable , but I am the SPOF .   Meaning that when
split brain occurs [ every 18-24 months ] it is me or no one who knows what
to do. Try to explain how to deal with split brain in advance.... For the
future ceph looks like it will be easier to maintain.

* We use Proxmox . So ceph and mons will share each node. I've used proxmox
for a few years and like the kvm / openvz management.

* Ceph hardware:

Four  hosts .  8 drives each.

OPSYS: raid-1  on ssd .

OSD: four disk raid 10 array using  2-TB drives.

Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM 128MB
Cache SAS 6Gb/s

the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
Cache SATA 6.0Gb/s   drives.

Journal: 200GB Intel DC S3700 Series

Spare disk for raid.

* more questions.
you wrote:
"In essence, if your current setup can't handle the loss of a single disk,
what happens if a node fails?
You will need to design (HW) and configure (various Ceph options) your
cluster to handle these things because at some point a recovery might be
unavoidable.

To prevent recoveries based on failed disks, use RAID, for node failures
you could permanently set OSD noout or have a monitoring software do that
when it detects a node failure."

I'll research  'OSD noout' .

Are there other setting I should read up on / consider?

For node reboots due to kernel upgrades -  how is that handled?   Of course
that would be scheduled for off hours.

Any other suggestions?

thanks for the suggestions,
Rob

On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer <chibi at gol.com> wrote:

>
> Hello,
>
> actually replying in the other thread was fine by me, it was after
> relevant in a sense to it.
> And you mentioned something important there, which you didn't mention
> below, that you're coming from DRBD with a lot of experience there.
>
> So do I and Ceph/RBD simply isn't (and probably never will be) an adequate
> replacement for DRBD in some use cases.
> I certainly plan to keep deploying DRBD where it makes more sense
> (IOPS/speed), while migrating everything else to Ceph.
>
> Anyway, lets look at your mail:
>
> On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:
>
> > I've a question regarding advice from these threads:
> >
> https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
> >
> > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> >
> >
> >
> >  Our current setup has 4 osd's per node.    When a drive  fails   the
> > cluster is almost unusable for data entry.   I want to change our set up
> > so that under no circumstances ever happens.
> >
>
> While you can pretty much avoid this from happening, your cluster should
> be able to handle a recovery.
> While Ceph is a bit more hamfisted than DRBD and definitely needs more
> controls and tuning to make recoveries have less of an impact you would
> see something similar with DRBD and badly configured recovery speeds.
>
> In essence, if your current setup can't handle the loss of a single disk,
> what happens if a node fails?
> You will need to design (HW) and configure (various Ceph options) your
> cluster to handle these things because at some point a recovery might be
> unavoidable.
>
> To prevent recoveries based on failed disks, use RAID, for node failures
> you could permanently set OSD noout or have a monitoring software do that
> when it detects a node failure.
>
> >  Network:  we use 2 IB switches and  bonding in fail over mode.
> >  Systems are two  Dell Poweredge r720 and Supermicro X8DT3 .
> >
>
> I'm confused. Those Dells tend to have 8 drive bays normally, don't they?
> So you're just using 4 HDDs for OSDs? No SSD journals?
> Just 2 storage nodes?
> Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual
> disk failures. Which will happen.
>
> Also that SM product number is for a motherboard, not a server, is that
> your monitor host?
> Anything production with data on in that you value should have 3 mon
> hosts, if you can't afford dedicated ones sharing them on an OSD node
> (preferably with the OS on SSDs to keep leveldb happy) is better than just
> one, because if that one dies or gets corrupted, your data is inaccessible.
>
> >  So looking at how to do things better we will try  '#4- anti-cephalopod'
> > .   That is a seriously funny phrase!
> >
> > We'll switch to using raid-10 or raid-6 and have one osd per node, using
> > high end raid controllers,  hot spares etc.
> >
> Are you still talking about the same hardware as above, just 4 HDDs for
> storage?
> With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you
> have more bays use up to 12 for RAID6 with a high performance and large
> HW cache controller.
>
> > And use one Intel 200gb S3700 per node for journal
> >
> That's barely enough for 4 HDDs at 365MB/s write speed, but will do
> nicely if those are in a RAID10 (half speed of individual drives).
> Keep in mind that your node will never be able to write faster than the
> speed of your journal.
>
> > My questions:
> >
> > is there a minimum number of OSD's which should be used?
> >
> If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes
> is sufficient to begin with.
> However your performance might not be what you expect (an OSD process
> seems to be incapable of doing more than 800 write IOPS).
> But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's
> not so much of an issue.
> In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it
> certainly is rather frustrating.
>
> In short, the more nodes (OSDs) you can deploy, the better the
> performance will be. And of course in case a node dies and you don't
> think it can be brought back in a sensible short time frame, having more
> than 2 nodes will enable you to do a recovery/rebalance and restore your
> redundancy to the desired level.
>
> > should  OSD's per node be the same?
> >
> It is advantageous to have identical disks and OSD sizes, makes the whole
> thing more predictable and you don't have to play with weights.
>
> As for having different number of OSDs per node, consider this example:
>
> 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size).
> What will happen here is that all the replicas from single OSD nodes might
> wind up on the 4 OSD node. So it better have more power in all aspects
> than the single OSD nodes.
> Now that node fails and you decide to let things rebalance as it can't be
> repaired shortly. But you cluster was half full and now it will be 100%
> full and become unusable (for writes).
>
> So the moral of the story, deploy as much identical HW as possible.
>
> Christian
>
> > best regards, Rob
> >
> >
> > PS:  I had asked above in middle of another thread...  please ignore
> > there.
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140727/36b3b94c/attachment.htm>