anti-cephalopod question

robertfantini@xxxxxxxxx (Robert Fantini) · Mon, 28 Jul 2014 04:19:16 -0400

I have 3 hosts that i want to use to test new setup...

Currently they have 3-4 OSD's each.

Could you suggest a fast way to remove all the OSD's ?

On Mon, Jul 28, 2014 at 3:49 AM, Christian Balzer <chibi at gol.com> wrote:

>
> Hello,
>
> On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
>
> > Hello Christian,
> >
> > Let me supply more info and answer some questions.
> >
> > * Our main concern is high availability, not speed.
> > Our storage requirements are not huge.
> > However we want good keyboard response 99.99% of the time.   We mostly do
> > data entry and reporting.   20-25  users doing mostly order , invoice
> > processing and email.
> >
> > * DRBD has been very reliable , but I am the SPOF .   Meaning that when
> > split brain occurs [ every 18-24 months ] it is me or no one who knows
> > what to do. Try to explain how to deal with split brain in advance....
> > For the future ceph looks like it will be easier to maintain.
> >
> The DRBD people would of course tell you to configure things in a way that
> a split brain can't happen. ^o^
>
> Note that given the right circumstances (too many OSDs down, MONs down)
> Ceph can wind up in a similar state.
>
> > * We use Proxmox . So ceph and mons will share each node. I've used
> > proxmox for a few years and like the kvm / openvz management.
> >
> I tried it some time ago, but at that time it was still stuck with 2.6.32
> due to OpenVZ and that wasn't acceptable to me for various reasons.
> I think it still is, too.
>
> > * Ceph hardware:
> >
> > Four  hosts .  8 drives each.
> >
> > OPSYS: raid-1  on ssd .
> >
> Good, that should be sufficient for running MONs (you will want 3).
>
> > OSD: four disk raid 10 array using  2-TB drives.
> >
> > Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
> > 128MB Cache SAS 6Gb/s
> >
> > the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
> > Cache SATA 6.0Gb/s   drives.
> >
> > Journal: 200GB Intel DC S3700 Series
> >
> > Spare disk for raid.
> >
> > * more questions.
> > you wrote:
> > "In essence, if your current setup can't handle the loss of a single
> > disk, what happens if a node fails?
> > You will need to design (HW) and configure (various Ceph options) your
> > cluster to handle these things because at some point a recovery might be
> > unavoidable.
> >
> > To prevent recoveries based on failed disks, use RAID, for node failures
> > you could permanently set OSD noout or have a monitoring software do that
> > when it detects a node failure."
> >
> > I'll research  'OSD noout' .
> >
> You probably might be happy with the "mon osd downout subtree limit" set
> to "host" as well.
> In that case you will need to manually trigger a rebuild (set that
> node/OSD to out) if you can't repair a failed node in a short time and
> keep your redundancy levels.
>
> > Are there other setting I should read up on / consider?
> >
> > For node reboots due to kernel upgrades -  how is that handled?   Of
> > course that would be scheduled for off hours.
> >
> Set noout before a planned downtime or live dangerously and assume it
> comes back within the timeout period (5 minutes IIRC).
>
> > Any other suggestions?
> >
> Test your cluster extensively before going into production.
>
> Fill it with enough data to be close to what you're expecting and fail one
> node/OSD.
>
> See how bad things become, try to determine where any bottlenecks are with
> tools like atop.
>
> While you've done pretty much everything to prevent that scenario from a
> disk failure with the RAID10 and by keeping nodes from being set out by
> whatever means you choose ("mon osd downout subtree limit = host" seems to
> work, I just tested it), having a cluster that doesn't melt down when
> recovering or at least knowing how bad things will be in such a scenario
> helps a lot.
>
> Regards,
>
> Christian
>
> > thanks for the suggestions,
> > Rob
> >
> >
> > On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer <chibi at gol.com> wrote:
> >
> > >
> > > Hello,
> > >
> > > actually replying in the other thread was fine by me, it was after
> > > relevant in a sense to it.
> > > And you mentioned something important there, which you didn't mention
> > > below, that you're coming from DRBD with a lot of experience there.
> > >
> > > So do I and Ceph/RBD simply isn't (and probably never will be) an
> > > adequate replacement for DRBD in some use cases.
> > > I certainly plan to keep deploying DRBD where it makes more sense
> > > (IOPS/speed), while migrating everything else to Ceph.
> > >
> > > Anyway, lets look at your mail:
> > >
> > > On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:
> > >
> > > > I've a question regarding advice from these threads:
> > > >
> > >
> https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
> > > >
> > > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> > > >
> > > >
> > > >
> > > >  Our current setup has 4 osd's per node.    When a drive  fails   the
> > > > cluster is almost unusable for data entry.   I want to change our
> > > > set up so that under no circumstances ever happens.
> > > >
> > >
> > > While you can pretty much avoid this from happening, your cluster
> > > should be able to handle a recovery.
> > > While Ceph is a bit more hamfisted than DRBD and definitely needs more
> > > controls and tuning to make recoveries have less of an impact you would
> > > see something similar with DRBD and badly configured recovery speeds.
> > >
> > > In essence, if your current setup can't handle the loss of a single
> > > disk, what happens if a node fails?
> > > You will need to design (HW) and configure (various Ceph options) your
> > > cluster to handle these things because at some point a recovery might
> > > be unavoidable.
> > >
> > > To prevent recoveries based on failed disks, use RAID, for node
> > > failures you could permanently set OSD noout or have a monitoring
> > > software do that when it detects a node failure.
> > >
> > > >  Network:  we use 2 IB switches and  bonding in fail over mode.
> > > >  Systems are two  Dell Poweredge r720 and Supermicro X8DT3 .
> > > >
> > >
> > > I'm confused. Those Dells tend to have 8 drive bays normally, don't
> > > they? So you're just using 4 HDDs for OSDs? No SSD journals?
> > > Just 2 storage nodes?
> > > Note that unless you do use RAIDed OSDs this leaves you vulnerable to
> > > dual disk failures. Which will happen.
> > >
> > > Also that SM product number is for a motherboard, not a server, is that
> > > your monitor host?
> > > Anything production with data on in that you value should have 3 mon
> > > hosts, if you can't afford dedicated ones sharing them on an OSD node
> > > (preferably with the OS on SSDs to keep leveldb happy) is better than
> > > just one, because if that one dies or gets corrupted, your data is
> > > inaccessible.
> > >
> > > >  So looking at how to do things better we will try  '#4-
> > > > anti-cephalopod' .   That is a seriously funny phrase!
> > > >
> > > > We'll switch to using raid-10 or raid-6 and have one osd per node,
> > > > using high end raid controllers,  hot spares etc.
> > > >
> > > Are you still talking about the same hardware as above, just 4 HDDs for
> > > storage?
> > > With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if
> > > you have more bays use up to 12 for RAID6 with a high performance and
> > > large HW cache controller.
> > >
> > > > And use one Intel 200gb S3700 per node for journal
> > > >
> > > That's barely enough for 4 HDDs at 365MB/s write speed, but will do
> > > nicely if those are in a RAID10 (half speed of individual drives).
> > > Keep in mind that your node will never be able to write faster than the
> > > speed of your journal.
> > >
> > > > My questions:
> > > >
> > > > is there a minimum number of OSD's which should be used?
> > > >
> > > If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2
> > > nodes is sufficient to begin with.
> > > However your performance might not be what you expect (an OSD process
> > > seems to be incapable of doing more than 800 write IOPS).
> > > But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's
> > > not so much of an issue.
> > > In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller
> > > it certainly is rather frustrating.
> > >
> > > In short, the more nodes (OSDs) you can deploy, the better the
> > > performance will be. And of course in case a node dies and you don't
> > > think it can be brought back in a sensible short time frame, having
> > > more than 2 nodes will enable you to do a recovery/rebalance and
> > > restore your redundancy to the desired level.
> > >
> > > > should  OSD's per node be the same?
> > > >
> > > It is advantageous to have identical disks and OSD sizes, makes the
> > > whole thing more predictable and you don't have to play with weights.
> > >
> > > As for having different number of OSDs per node, consider this example:
> > >
> > > 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same
> > > size). What will happen here is that all the replicas from single OSD
> > > nodes might wind up on the 4 OSD node. So it better have more power in
> > > all aspects than the single OSD nodes.
> > > Now that node fails and you decide to let things rebalance as it can't
> > > be repaired shortly. But you cluster was half full and now it will be
> > > 100% full and become unusable (for writes).
> > >
> > > So the moral of the story, deploy as much identical HW as possible.
> > >
> > > Christian
> > >
> > > > best regards, Rob
> > > >
> > > >
> > > > PS:  I had asked above in middle of another thread...  please ignore
> > > > there.
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi at gol.com           Global OnLine Japan/Fusion Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140728/1424ba2f/attachment.htm>