Hello, I shall top-quote, summarize here. Firstly we have to consider that Ceph is deployed by people with a wide variety of needs, budgets and most of all cluster sizes. Wido has the pleasure (or is that nightmare? ^o^) to deal with a really huge cluster, thousands of OSDs and an according larg number of nodes (if memory serves me). While many others have comparatively small clusters, with decisively less than 10 storage nodes, like me. So the approach and philosophy is obviously going to differ quite a bit on either end of this spectrum. If you start large (dozens of nodes and hundreds of OSDs), where only a small fraction of your data (10% or less) is in a failure domain (host initially), then you can play fast and loose and save a lot of money by designing your machines and infrastructure accordingly. Things like redundant OS drives, PSUs, even network links on the host if the cluster big enough. In a cluster of sufficient size, a node failure and the resulting data movements is just background noise. OTOH with smaller clusters, you obviously want to avoid failures if at all possible, since not only the re-balancing is going to be more painful, but the resulting smaller cluster will also have less performance. This is why my OSD nodes have all the redundancy bells and whistles there are, simply because a cluster big enough to not need them would be both vastly more expensive despite cheaper individual node costs and also underutilized. Of course if you should grow to a certain point, maybe your next generation of OSD nodes can be build on the cheap w/o compromising safe operations. No matter what size your cluster is though, setting "mon_osd_down_out_subtree_limit" to an appropriate value (host for small clusters) is a good way to avoid re-balancing storms when a node (or some larger segment) goes down, given that recovering the failed part can be significantly faster than moving tons of data around. This of course implies 24/7 monitoring and access to the HW. As for dedicated MONs, I usually try to have the primary MON (lowest IP) on dedicated HW and to be sure that MONs residing on OSD nodes have fast storage and enough CPU/RAM to be happy even if the OSDs go on full spin. Which incidentally is why your shared MONs are likely a better fit for a HDD based OSD node than a SSD based one used for a cache pool for example. Anyway, MONs are clearly candidates for having their OS (where /var/lib resides) on RAIDed, hot-swappable fast and durable and power-loss safe SSDs, just so you can avoid loosing one and having to shut down the whole thing in the (unlikely) case of a SSD failure. Regards, Christian On Sat, 13 Aug 2016 09:43:26 +0200 wido@xxxxxxxx wrote: > > > > Op 13 aug. 2016 om 08:58 heeft Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> het volgende geschreven: > > > > > >>> Op 13 aug. 2016 om 03:19 heeft Bill Sharer het volgende geschreven: > >>> > >>> If all the system disk does is handle the o/s (ie osd journals are > >>> on dedicated or osd drives as well), no problem. Just rebuild the > >>> system and copy the ceph.conf back in when you re-install ceph. > >>> Keep a spare copy of your original fstab to keep your osd filesystem > >>> mounts straight. > >> > >> With systems deployed with ceph-disk/ceph-deploy you no longer need a > >> fstab. Udev handles it. > >> > >>> Just keep in mind that you are down 11 osds while that system drive > >>> gets rebuilt though. It's safer to do 10 osds and then have a > >>> mirror set for the system disk. > >> > >> In the years that I run Ceph I rarely see OS disks fail. Why bother? > >> Ceph is designed for failure. > >> > >> I would not sacrifice a OSD slot for a OS disk. Also, let's say a > >> additional OS disk is €100. > >> > >> If you put that disk in 20 machines that's €2.000. For that money > >> you can even buy a additional chassis. > >> > >> No, I would run on a single OS disk. It fails? Let it fail. Re-install > >> and you're good again. > >> > >> Ceph makes sure the data is safe. > >> > > > > Wido, > > > > can you elaborate a little bit more on this? How does CEPH achieve that? Is it by redundant MONs? > > > > No, Ceph replicates over hosts by default. So you can loose a host and the other ones will have copies. > > > > To my understanding the OSD mapping is needed to have the cluster back. In our setup (I assume in others as well) that is stored in the OS disk.Furthermore, our MONs are running on the same host as OSDs. So if the OS disk fails not only we loose the OSD host but we also loose the MON node. Is there another way to be protected by such a failure besides additional MONs? > > > > Aha, MON on the OSD host. I never recommend that. Try to use dedicated machines with a good SSD for MONs. > > Technically you can run the MON on the OSD nodes, but I always try to avoid it. It just isn't practical when stuff really goes wrong. > > Wido > > > We recently had a problem where a user accidentally deleted a volume. Of course this has nothing to do with OS disk failure itself but we 've been in the loop to start looking for other possible failures on our system that could jeopardize data and this thread got my attention. > > > > > > Warmest regards, > > > > George > > > > > >> Wido > >> > >> Bill Sharer > >> > >>> On 08/12/2016 03:33 PM, Ronny Aasen wrote: > >>> > >>>> On 12.08.2016 13:41, Félix Barbeira wrote: > >>>> > >>>> Hi, > >>>> > >>>> I'm planning to make a ceph cluster but I have a serious doubt. At > >>>> this moment we have ~10 servers DELL R730xd with 12x4TB SATA > >>>> disks. The official ceph docs says: > >>>> > >>>> "We recommend using a dedicated drive for the operating system and > >>>> software, and one drive for each Ceph OSD Daemon you run on the > >>>> host." > >>>> > >>>> I could use for example 1 disk for the OS and 11 for OSD data. In > >>>> the operating system I would run 11 daemons to control the OSDs. > >>>> But...what happen to the cluster if the disk with the OS fails?? > >>>> maybe the cluster thinks that 11 OSD failed and try to replicate > >>>> all that data over the cluster...that sounds no good. > >>>> > >>>> Should I use 2 disks for the OS making a RAID1? in this case I'm > >>>> "wasting" 8TB only for ~10GB that the OS needs. > >>>> > >>>> In all the docs that i've been reading says ceph has no unique > >>>> single point of failure, so I think that this scenario must have a > >>>> optimal solution, maybe somebody could help me. > >>>> > >>>> Thanks in advance. > >>>> > >>>> -- > >>>> > >>>> Félix Barbeira. > >>> if you do not have dedicated slots on the back for OS disks, then i > >>> would recomend using SATADOM flash modules directly into a SATA port > >>> internal in the machine. Saves you 2 slots for osd's and they are > >>> quite reliable. you could even use 2 sd cards if your machine have > >>> the internal SD slot > >>> > >>> > >> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf > >>> [1] > >>> > >>> kind regards > >>> Ronny Aasen > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx [2] > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-u > >> ph.com > >> http://li > >> > >>> i/ceph-users-ceph.com > >> > >> > >> Links: > >> ------ > >> [1] > >> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf > >> [2] mailto:ceph-users@xxxxxxxxxxxxxx > >> [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> [4] mailto:bsharer@xxxxxxxxxxxxxx > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com