Re: 2x replication: A BIG warning

Wido den Hollander <wido@xxxxxxxx> · Wed, 7 Dec 2016 14:57:58 +0100 (CET)

> Op 7 december 2016 om 10:06 schreef Dan van der Ster <dan@xxxxxxxxxxxxxx>:
> 
> 
> Hi Wido,
> 
> Thanks for the warning. We have one pool as you described (size 2,
> min_size 1), simply because 3 replicas would be too expensive and
> erasure coding didn't meet our performance requirements. We are well
> aware of the risks, but of course this is a balancing act between risk
> and cost.
> 

Well, that is good. You are aware of the risk.

> Anyway, I'm curious if you ever use
> osd_find_best_info_ignore_history_les in order to recover incomplete
> PGs (while accepting the possibility of data loss). I've used this on
> two colleagues' clusters over the past few months and as far as they
> could tell there was no detectable data loss in either case.
> 

No, not really. Most cases were a true drive failure and a second one during recovery. XFS was broken underneath.

> So I run with size = 2 because if something bad happens I'll
> re-activate the PG with osd_find_best_info_ignore_history_les, then
> re-scrub both within Ceph and via our external application.
> 
> Any thoughts on that?
> 

No real thoughts, but it will be mainly useful in a flapping case where a OSD might have outdated data, but that's still better then nothing there.

Wido

> Cheers, Dan
> 
> P.S. we're going to retry erasure coding for this cluster in 2017,
> because clearly 4+2 or similar would be much safer than size 2,
> provided we can get the needed performance.
> 
> 
> 
> On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> > Hi,
> >
> > As a Ceph consultant I get numerous calls throughout the year to help people with getting their broken Ceph clusters back online.
> >
> > The causes of downtime vary vastly, but one of the biggest causes is that people use replication 2x. size = 2, min_size = 1.
> >
> > In 2016 the amount of cases I have where data was lost due to these settings grew exponentially.
> >
> > Usually a disk failed, recovery kicks in and while recovery is happening a second disk fails. Causing PGs to become incomplete.
> >
> > There have been to many times where I had to use xfs_repair on broken disks and use ceph-objectstore-tool to export/import PGs.
> >
> > I really don't like these cases, mainly because they can be prevented easily by using size = 3 and min_size = 2 for all pools.
> >
> > With size = 2 you go into the danger zone as soon as a single disk/daemon fails. With size = 3 you always have two additional copies left thus keeping your data safe(r).
> >
> > If you are running CephFS, at least consider running the 'metadata' pool with size = 3 to keep the MDS happy.
> >
> > Please, let this be a big warning to everybody who is running with size = 2. The downtime and problems caused by missing objects/replicas are usually big and it takes days to recover from those. But very often data is lost and/or corrupted which causes even more problems.
> >
> > I can't stress this enough. Running with size = 2 in production is a SERIOUS hazard and should not be done imho.
> >
> > To anyone out there running with size = 2, please reconsider this!
> >
> > Thanks,
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com