Re: what happen to the OSDs if the OS disk dies?

Félix Barbeira <fbarbeira@xxxxxxxxx> · Tue, 16 Aug 2016 10:43:15 +0200

Thanks everybody for the answers, it really helped me a lot. So, to sum up, this is the options that I have:
OS in a RAID1.
PROS: the cluster is protected against OS failures. If one of this disks fail, it could be easily replaced because it is hot-swappable.
CONS: we are "wasting" 2 bays of disks that could be destinated to OSDs.
* In the case of R730xd we have the option to put 2x2.5" SSDs disks on the slots on the back like Brian says. For me this is clearly the best option. We'll see if the department of finance has the same opinion :)
OS in a single disk.
PROS: we are using only 1 disk slot. It could be a cheaper disk than the 4TB model because we are only going to use ~10GB.
CONS: the OS is not protected against failures and if this disk fails, the OSDs in this machine (11) fails too. In this case we might try to adjust the configuration in order to not reconstruct all this OSDs data and wait until the OS disk is replaced (I'm not sure if this is possible, I should check the docs).
OS in a SATADOM ( http://www.innodisk.com/intel/product.html )
PROS: we have all the disk slots available to use for OSDs.
CONS: I have no experience with this kind of devices, I'm not sure if the are trustworthy. This devices are fast but they are not raid protected, it's a single point of failure like the previous option.
OS boot from a SAN (this is the option I'm considering for the non R730xd machines, which does not have the 2x2.5" slots on the back).
PROS: all the disk slots are available to OSDs. The OS disk is protected by RAID on the remote storage.
CONS: we depend of the network, I guess the OS device does not require a lot of traffic, all the ceph OSDs network traffic should be managed through another network card.
Maybe I'm missing some other option, in that case please tell me, it would be helpful.

It would be really helpful if somebody has experience with the option of booting OS from a SAN, sharing their pros/cons experience because that option it's very interesting to me.

2016-08-14 14:57 GMT+02:00 Christian Balzer <chibi@xxxxxxx>:

Hello,

I shall top-quote, summarize here.

Firstly we have to consider that Ceph is deployed by people with a wide

variety of needs, budgets and most of all cluster sizes.

Wido has the pleasure (or is that nightmare? ^o^) to deal with a really

huge cluster, thousands of OSDs and an according larg number of nodes (if

memory serves me).

While many others have comparatively small clusters, with decisively less

than 10 storage nodes, like me.

So the approach and philosophy is obviously going to differ quite a bit

on either end of this spectrum.

If you start large (dozens of nodes and hundreds of OSDs), where only a

small fraction of your data (10% or less) is in a failure domain (host

initially), then you can play fast and loose and save a lot of money by

designing your machines and infrastructure accordingly.

Things like redundant OS drives, PSUs, even network links on the host if

the cluster big enough.

In a cluster of sufficient size, a node failure and the resulting data

movements is just background noise.

OTOH with smaller clusters, you obviously want to avoid failures if at all

possible, since not only the re-balancing is going to be more painful, but

the resulting smaller cluster will also have less performance.

This is why my OSD nodes have all the redundancy bells and whistles there

are, simply because a cluster big enough to not need them would be both

vastly more expensive despite cheaper individual node costs and also

underutilized.

Of course if you should grow to a certain point, maybe your next

generation of OSD nodes can be build on the cheap w/o compromising safe

operations.

No matter what size your cluster is though, setting

"mon_osd_down_out_subtree_limit" to an appropriate value (host for small

clusters) is a good way to avoid re-balancing storms when a node (or some

larger segment) goes down, given that recovering the failed part can be

significantly faster than moving tons of data around.

This of course implies 24/7 monitoring and access to the HW.

As for dedicated MONs, I usually try to have the primary MON (lowest IP)

on dedicated HW and to be sure that MONs residing on OSD nodes have fast

storage and enough CPU/RAM to be happy even if the OSDs go on full spin.

Which incidentally is why your shared MONs are likely a better fit for a

HDD based OSD node than a SSD based one used for a cache pool for example.

Anyway, MONs are clearly candidates for having their OS (where /var/lib

resides) on RAIDed, hot-swappable fast and durable and power-loss safe

SSDs, just so you can avoid loosing one and having to shut down the whole

thing in the (unlikely) case of a SSD failure.

Regards,

Christian

On Sat, 13 Aug 2016 09:43:26 +0200 wido@xxxxxxxx wrote:

>

>

> > Op 13 aug. 2016 om 08:58 heeft Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> het volgende geschreven:

> >

> >

> >>> Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het volgende geschreven:

> >>>

> >>> If all the system disk does is handle the o/s (ie osd journals are

> >>> on dedicated or osd drives as well), no problem.Â Just rebuild the

> >>> system and copy the ceph.conf back in when you re-install ceph.Â

> >>> Keep a spare copy of your original fstab to keep your osd filesystem

> >>> mounts straight.

> >>

> >> With systems deployed with ceph-disk/ceph-deploy you no longer need a

> >> fstab. Udev handles it.

> >>

> >>> Just keep in mind that you are down 11 osds while that system drive

> >>> gets rebuilt though.Â It's safer to do 10 osds and then have a

> >>> mirror set for the system disk.

> >>

> >> In the years that I run Ceph I rarely see OS disks fail. Why bother?

> >> Ceph is designed for failure.

> >>

> >> I would not sacrifice a OSD slot for a OS disk. Also, let's say a

> >> additional OS disk is €100.

> >>

> >> If you put that disk in 20 machines that's €2.000. For that money

> >> you can even buy a additional chassis.

> >>

> >> No, I would run on a single OS disk. It fails? Let it fail. Re-install

> >> and you're good again.

> >>

> >> Ceph makes sure the data is safe.

> >>

> >

> > Wido,

> >

> > can you elaborate a little bit more on this? How does CEPH achieve that? Is it by redundant MONs?

> >

>

> No, Ceph replicates over hosts by default. So you can loose a host and the other ones will have copies.

>

>

> > To my understanding the OSD mapping is needed to have the cluster back. In our setup (I assume in others as well) that is stored in the OS disk.Furthermore, our MONs are running on the same host as OSDs. So if the OS disk fails not only we loose the OSD host but we also loose the MON node. Is there another way to be protected by such a failure besides additional MONs?

> >

>

> Aha, MON on the OSD host. I never recommend that. Try to use dedicated machines with a good SSD for MONs.

>

> Technically you can run the MON on the OSD nodes, but I always try to avoid it. It just isn't practical when stuff really goes wrong.

>

> Wido

>

> > We recently had a problem where a user accidentally deleted a volume. Of course this has nothing to do with OS disk failure itself but we 've been in the loop to start looking for other possible failures on our system that could jeopardize data and this thread got my attention.

> >

> >

> > Warmest regards,

> >

> > George

> >

> >

> >> Wido

> >>

> >> Bill Sharer

> >>

> >>> On 08/12/2016 03:33 PM, Ronny Aasen wrote:

> >>>

> >>>> On 12.08.2016 13:41, FÃ©lix Barbeira wrote:

> >>>>

> >>>> Hi,

> >>>>

> >>>> I'm planning to make a ceph cluster but I have a serious doubt. At

> >>>> this moment we have ~10 servers DELL R730xd with 12x4TB SATA

> >>>> disks. The official ceph docs says:

> >>>>

> >>>> "We recommend using a dedicated drive for the operating system and

> >>>> software, and one drive for each Ceph OSD Daemon you run on the

> >>>> host."

> >>>>

> >>>> I could use for example 1 disk for the OS and 11 for OSD data. In

> >>>> the operating system I would run 11 daemons to control the OSDs.

> >>>> But...what happen to the cluster if the disk with the OS fails??

> >>>> maybe the cluster thinks that 11 OSD failed and try to replicate

> >>>> all that data over the cluster...that sounds no good.

> >>>>

> >>>> Should I use 2 disks for the OS making a RAID1? in this case I'm

> >>>> "wasting" 8TB only for ~10GB that the OS needs.

> >>>>

> >>>> In all the docs that i've been reading says ceph has no unique

> >>>> single point of failure, so I think that this scenario must have a

> >>>> optimal solution, maybe somebody could help me.

> >>>>

> >>>> Thanks in advance.

> >>>>

> >>>> --

> >>>>

> >>>> FÃ©lix Barbeira.

> >>> if you do not have dedicated slots on the back for OS disks, then i

> >>> would recomend using SATADOM flash modules directly into a SATA port

> >>> internal in the machine. Saves you 2 slots for osd's and they are

> >>> quite reliable. you could even use 2 sd cards if your machine have

> >>> the internal SD slot

> >>>

> >>>

> >> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

> >>> [1]

> >>>

> >>> kind regards

> >>> Ronny Aasen

> >>>

> >>> _______________________________________________

> >>> ceph-users mailing list

> >>> ceph-users@xxxxxxxxxxxxxx [2]

> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]

> >>>

> >>> _______________________________________________

> >>> ceph-users mailing list

> >>> ceph-u

> >> ph.com

> >> http://li

> >>

> >>> i/ceph-users-ceph.com

> >>

> >>

> >> Links:

> >> ------

> >> [1]

> >> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

> >> [2] mailto:ceph-users@xxxxxxxxxx.com

> >> [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> [4] mailto:bsharer@xxxxxxxxxxxxxx

> >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Félix Barbeira.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com