Re: 2x replication: A BIG warning

Wolfgang Link <w.link@xxxxxxxxxxx> · Wed, 7 Dec 2016 13:07:31 +0100

Hi

I'm very interested in this calculation.
What assumption do you have done?
Network speed, osd degree of fulfilment, etc....?

Thanks

Wolfgang

On 12/07/2016 11:16 AM, Дмитрий Глушенок wrote:
> Hi,
> 
> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on
> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8
> TB disk. So, every 18th recovery will probably fail. Similarly to RAID6
> (two parity disks) size=3 mitigates the problem.
> By the way - why it is a common opinion that using RAID (RAID6) with
> Ceph (size=2) is bad idea? It is cheaper than size=3, all hardware disk
> errors are handled by RAID (instead of OS/Ceph), decreases OSD count,
> adds some battery-backed cache and increases performance of single OSD.
> 
>> 7 дек. 2016 г., в 11:08, Wido den Hollander <wido@xxxxxxxx
>> <mailto:wido@xxxxxxxx>> написал(а):
>>
>> Hi,
>>
>> As a Ceph consultant I get numerous calls throughout the year to help
>> people with getting their broken Ceph clusters back online.
>>
>> The causes of downtime vary vastly, but one of the biggest causes is
>> that people use replication 2x. size = 2, min_size = 1.
>>
>> In 2016 the amount of cases I have where data was lost due to these
>> settings grew exponentially.
>>
>> Usually a disk failed, recovery kicks in and while recovery is
>> happening a second disk fails. Causing PGs to become incomplete.
>>
>> There have been to many times where I had to use xfs_repair on broken
>> disks and use ceph-objectstore-tool to export/import PGs.
>>
>> I really don't like these cases, mainly because they can be prevented
>> easily by using size = 3 and min_size = 2 for all pools.
>>
>> With size = 2 you go into the danger zone as soon as a single
>> disk/daemon fails. With size = 3 you always have two additional copies
>> left thus keeping your data safe(r).
>>
>> If you are running CephFS, at least consider running the 'metadata'
>> pool with size = 3 to keep the MDS happy.
>>
>> Please, let this be a big warning to everybody who is running with
>> size = 2. The downtime and problems caused by missing objects/replicas
>> are usually big and it takes days to recover from those. But very
>> often data is lost and/or corrupted which causes even more problems.
>>
>> I can't stress this enough. Running with size = 2 in production is a
>> SERIOUS hazard and should not be done imho.
>>
>> To anyone out there running with size = 2, please reconsider this!
>>
>> Thanks,
>>
>> Wido
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Dmitry Glushenok
> Jet Infosystems
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com