Re: why was osd pool default size changed from 2 to 3.

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Mon, 26 Oct 2015 09:41:25 -0600

TL;DR - Running two copies in my cluster cost me a weekend, and many more hours of productive time during normal working hours. Networking problems can be just as destructive as disk problems. I only run 2 copies on throwaway data.

So, I have personal experience in data loss when running only two copies. I had a networking problem in my ceph cluster, and it took me a long time to track it down because it was an intermittent thing that caused the node with the faulty connection to not only get marked out by it's peers, but also caused it to incorrectly mark out other nodes. It was a mess, that I made worse by trying to force recovery before I really knew what the problem was since it was so elusive.
In the end, the cluster tried to do recovery on PGs that had gotten degraded, but because there were only two copies it had no way to tell which one was correct, and when I forced it to choose it often chose wrong. All of the data was VM images, so in the end, I ended up having small bits of random corruption across almost all my VMs. It took me about 40 hours of work over a weekend to get things recovered (onto spare desktop machines since I still hadn't found the problem and didn't trust the cluster) and rebuilt to make sure that people could work on monday, and I was cleaning up little bits of leftover mess for weeks. Once I finally found and repaired the problem, it was another several days worth of work to get the cluster rebuilt and the VMs migrated back onto it. Never will I run only two copies on things I actually care about ever again, regardless of the quality of the underlying disk hardware. In my case, the disks were fine all along.

QH

On Sat, Oct 24, 2015 at 8:35 AM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

There have been COUNTLESS discussions about Ceph reliability, fault

tolerance and so forth in this very ML.

Google is very much evil, but in this case it is your friend.

In those threads you will find several reliability calculators, some more

flawed than others, but penultimately you do not use a replica of 2 for

the same reasons people don't use RAID5 for anything valuable.

A replication of 2 MAY be fine with very reliable, fast and not too large

SSDs, but that's about it.

Spinning rust is never safe with just one copy.

Christian

On Sat, 24 Oct 2015 09:41:35 +0200 Stefan Eriksson wrote:

> > Am 23.10.2015 um 20:53 schrieb Gregory Farnum:

> >> On Fri, Oct 23, 2015 at 8:17 AM, Stefan Eriksson <stefan@xxxxxxxxxxx>

> wrote:

> >>

> >> Nothing changed to make two copies less secure. 3 copies is just so

> >> much more secure and is the number that all the companies providing

> >> support recommend, so we changed the default.

> >> (If you're using it for data you care about, you should really use 3

> copies!)

> >> -Greg

> >

> > I assume that number really depends on the (number of) OSDs you have in

> your crush rule for that pool. A replication of

> > 2 might be ok for a pool spread over 10 osds, but not for one spread

> > over

> 100 osds....

> >

> > Corin

> >

>

> I'm also interested in this, what changes when you add 100+ OSDs (to

> warrant 3 replicas instead of 2), and the reasoning as to why "the

> companies providing support recommend 3." ?

> Theoretically it seems secure to have two replicas.

> If you have 100+ OSDs, I can see that maintenance will take much longer,

> and if you use "set noout" then a single PG will be active when the other

> replica is under maintenance.

> But if you "crush reweight to 0" before the maintenance this would not be

> an issue.

> Is this the main reason?

>

> From what I can gather even if you add new OSDs to the cluster and the

> balancing kicks in, it still maintains its two replicas.

>

> thanks.

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Fusion Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com