Re: 2x replication: A BIG warning

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 7 Dec 2016 10:06:00 +0100

Hi Wido,

Thanks for the warning. We have one pool as you described (size 2,
min_size 1), simply because 3 replicas would be too expensive and
erasure coding didn't meet our performance requirements. We are well
aware of the risks, but of course this is a balancing act between risk
and cost.

Anyway, I'm curious if you ever use
osd_find_best_info_ignore_history_les in order to recover incomplete
PGs (while accepting the possibility of data loss). I've used this on
two colleagues' clusters over the past few months and as far as they
could tell there was no detectable data loss in either case.

So I run with size = 2 because if something bad happens I'll
re-activate the PG with osd_find_best_info_ignore_history_les, then
re-scrub both within Ceph and via our external application.

Any thoughts on that?

Cheers, Dan

P.S. we're going to retry erasure coding for this cluster in 2017,
because clearly 4+2 or similar would be much safer than size 2,
provided we can get the needed performance.

On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Hi,
>
> As a Ceph consultant I get numerous calls throughout the year to help people with getting their broken Ceph clusters back online.
>
> The causes of downtime vary vastly, but one of the biggest causes is that people use replication 2x. size = 2, min_size = 1.
>
> In 2016 the amount of cases I have where data was lost due to these settings grew exponentially.
>
> Usually a disk failed, recovery kicks in and while recovery is happening a second disk fails. Causing PGs to become incomplete.
>
> There have been to many times where I had to use xfs_repair on broken disks and use ceph-objectstore-tool to export/import PGs.
>
> I really don't like these cases, mainly because they can be prevented easily by using size = 3 and min_size = 2 for all pools.
>
> With size = 2 you go into the danger zone as soon as a single disk/daemon fails. With size = 3 you always have two additional copies left thus keeping your data safe(r).
>
> If you are running CephFS, at least consider running the 'metadata' pool with size = 3 to keep the MDS happy.
>
> Please, let this be a big warning to everybody who is running with size = 2. The downtime and problems caused by missing objects/replicas are usually big and it takes days to recover from those. But very often data is lost and/or corrupted which causes even more problems.
>
> I can't stress this enough. Running with size = 2 in production is a SERIOUS hazard and should not be done imho.
>
> To anyone out there running with size = 2, please reconsider this!
>
> Thanks,
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com