Re: why was osd pool default size changed from 2 to 3.

Wido den Hollander <wido@xxxxxxxx> · Sat, 24 Oct 2015 09:55:38 +0200

On 10/24/2015 09:41 AM, Stefan Eriksson wrote:
>> Am 23.10.2015 um 20:53 schrieb Gregory Farnum:
>>> On Fri, Oct 23, 2015 at 8:17 AM, Stefan Eriksson <stefan@xxxxxxxxxxx>
> wrote:
>>>
>>> Nothing changed to make two copies less secure. 3 copies is just so
>>> much more secure and is the number that all the companies providing
>>> support recommend, so we changed the default.
>>> (If you're using it for data you care about, you should really use 3
> copies!)
>>> -Greg
>>
>> I assume that number really depends on the (number of) OSDs you have in
> your crush rule for that pool. A replication of
>> 2 might be ok for a pool spread over 10 osds, but not for one spread over
> 100 osds....
>>
>> Corin
>>
> 
> I'm also interested in this, what changes when you add 100+ OSDs (to
> warrant 3 replicas instead of 2), and the reasoning as to why "the
> companies providing support recommend 3." ?
> Theoretically it seems secure to have two replicas.
> If you have 100+ OSDs, I can see that maintenance will take much longer,
> and if you use "set noout" then a single PG will be active when the other
> replica is under maintenance.
> But if you "crush reweight to 0" before the maintenance this would not be
> an issue.
> Is this the main reason?
> 
> From what I can gather even if you add new OSDs to the cluster and the
> balancing kicks in, it still maintains its two replicas.
> 

No, the danger is that your only remaining replica dies during recovery.
I've seen this happen twice in two different clusters last week.

In one cluster a 3TB drive died and while recovering a 2nd 3TB drive
died which caused some PGs to go to 'undersized'. min_size was set 2 to.

In another cluster a 1TB SSD died and while we were recovering from that
failure, another SSD failed causing the same situation as described above.

IIRC the guys at Cern even run with 4 replicas since they even don't
think 3 is safe.

2 replicas isn't safe, no matter how big or small the cluster is. With
disks becoming larger recovery times will grow. In that window you don't
want to run on a single replica.

> thanks.
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com