Re: Worst thing that can happen if I have size= 2

Frank Schilder <frans@xxxxxx> · Fri, 5 Feb 2021 09:01:51 +0000

I think the answer is very simple: Data loss. You are setting yourself up for data loss. Having only +1 redundancy is a design flaw and you will be fully responsible for loosing data on such a set-up. If this is not a problem, then that's an option. If this will get you fired, its not.

> There is a big difference between traditional RAID1 and Ceph. Namely, with
> Ceph, there are nodes where OSDs are running, and these nodes need
> maintenance. You want to be able to perform maintenance even if you have
> one broken OSD, that's why the recommendation is to have three copies with
> Ceph. There is no such "maintenance" consideration with traditional RAID1,
> so two copies are OK there.

Yes, this is exactly the point. The keyword is "redundancy under degraded conditions". Its not just that you want to be able to maintain stuff if an OSD is down, you want to be able to maintain stuff without risking data loss every single time. A simple example is OS updates that require reboots. Every of these operations opens a window of opportunity for something else to fail.

The other thing is admin errors. Redundancy under degraded conditions allows admins to commit 1 or more extra mistakes during maintenance. I learned this the hard way when upgrading our MON data disks. We have 3 MONs and I needed to migrate each MON store to new storage. Of course I managed to install the new disks in one and wipe the MON store on another MON. 2 hours downtime. Will upgrade to 5 MONs as soon as possible.

More serious examples are ceph upgrades. There are plenty of instances where these went wrong for one or the other reason and people needed to redeploy entire OSD hosts. Loooong window of opportunity for data loss during complete rebuild.

And never trust your boss when he says "we will replace everything long before MTBF". This is BS as soon as budgets get cut.

I think, however, another really important aspect is the data security. In a small cluster you might get away with thinking in typical RAID terms. However, a scale-out cluster is defined by the property that multiple simultaneous disk fails will be observed regularly. Simultaneous meaning fails within the window of opportunity opened by degraded objects being present.

The limit for observing this is not as high as one might think. Pushing prices means pushing hardware to the physical limits and quality control will not catch everything. We got a batch of 8 disks that seem not to be great. I had already one fail (half a year in production) and others regularly show up with slow ops. Its not bad enough to get them replaced, so I have to deal with it. They are all in one host, so I can sleep, but it is quite likely that a few of them go while the cluster is rebuilding redundancy.

For scale-out storage the distributed RAID of ceph comes to the rescue, without this it would be impossible to run a scale out system. If you do the stats on the probability of loosing sufficiently many OSDs that share a PG, you will find out that this probability goes down exponentially with the number of extra copies/shards, where +1 just leaves you at ordinary RAID level - meaning its dangerous.

Taking all of this together, maintainability and probability of data loss, I regret that I didn't go for EC 8+3 (3 extra shards) instead of 8+2. For replication the same holds, 3-times is the lowest number that is safe but 4 is a lot lot better.

Bottom line is, data loss and ruined weekends/holidays are not worth going cheap. If I get an hardware alert at night, I want to be able to turn around and continue sleeping.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alexander E. Patrakov <patrakov@xxxxxxxxx>
Sent: 04 February 2021 11:35:27
To: Mario Giammarco
Cc: ceph-users
Subject:  Re: Worst thing that can happen if I have size= 2

There is a big difference between traditional RAID1 and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.

чт, 4 февр. 2021 г. в 00:49, Mario Giammarco <mgiammarco@xxxxxxxxx>:

> Thanks Simon and thanks to other people that have replied.
> Sorry but I try to explain myself better.
> It is evident to me that if I have two copies of data, one brokes and while
> ceph creates again a new copy of the data also the disk with the second
> copy brokes you lose the data.
> It is obvious and a bit paranoid because many servers on many customers run
> on raid1 and so you are saying: yeah you have two copies of the data but
> you can broke both. Consider that in ceph recovery is automatic, with raid1
> some one must manually go to the customer and change disks. So ceph is
> already an improvement in this case even with size=2. With size 3 and min 2
> it is a bigger improvement I know.
>
> What I ask is this: what happens with min_size=1 and split brain, network
> down or similar things: do ceph block writes because it has no quorum on
> monitors? Are there some failure scenarios that I have not considered?
> Thanks again!
> Mario
>
>
>
> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> sironside@xxxxxxxxxxxxx> ha scritto:
>
> > On 03/02/2021 09:24, Mario Giammarco wrote:
> > > Hello,
> > > Imagine this situation:
> > > - 3 servers with ceph
> > > - a pool with size 2 min 1
> > >
> > > I know perfectly the size 3 and min 2 is better.
> > > I would like to know what is the worst thing that can happen:
> >
> > Hi Mario,
> >
> > This thread is worth a read, it's an oldie but a goodie:
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
> >
> > Especially this post, which helped me understand the importance of
> > min_size=2
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
> >
> > Cheers,
> > Simon
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

--
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx