> Op 7 december 2016 om 10:06 schreef Dan van der Ster <dan@xxxxxxxxxxxxxx>: > > > Hi Wido, > > Thanks for the warning. We have one pool as you described (size 2, > min_size 1), simply because 3 replicas would be too expensive and > erasure coding didn't meet our performance requirements. We are well > aware of the risks, but of course this is a balancing act between risk > and cost. > Well, that is good. You are aware of the risk. > Anyway, I'm curious if you ever use > osd_find_best_info_ignore_history_les in order to recover incomplete > PGs (while accepting the possibility of data loss). I've used this on > two colleagues' clusters over the past few months and as far as they > could tell there was no detectable data loss in either case. > No, not really. Most cases were a true drive failure and a second one during recovery. XFS was broken underneath. > So I run with size = 2 because if something bad happens I'll > re-activate the PG with osd_find_best_info_ignore_history_les, then > re-scrub both within Ceph and via our external application. > > Any thoughts on that? > No real thoughts, but it will be mainly useful in a flapping case where a OSD might have outdated data, but that's still better then nothing there. Wido > Cheers, Dan > > P.S. we're going to retry erasure coding for this cluster in 2017, > because clearly 4+2 or similar would be much safer than size 2, > provided we can get the needed performance. > > > > On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > > Hi, > > > > As a Ceph consultant I get numerous calls throughout the year to help people with getting their broken Ceph clusters back online. > > > > The causes of downtime vary vastly, but one of the biggest causes is that people use replication 2x. size = 2, min_size = 1. > > > > In 2016 the amount of cases I have where data was lost due to these settings grew exponentially. > > > > Usually a disk failed, recovery kicks in and while recovery is happening a second disk fails. Causing PGs to become incomplete. > > > > There have been to many times where I had to use xfs_repair on broken disks and use ceph-objectstore-tool to export/import PGs. > > > > I really don't like these cases, mainly because they can be prevented easily by using size = 3 and min_size = 2 for all pools. > > > > With size = 2 you go into the danger zone as soon as a single disk/daemon fails. With size = 3 you always have two additional copies left thus keeping your data safe(r). > > > > If you are running CephFS, at least consider running the 'metadata' pool with size = 3 to keep the MDS happy. > > > > Please, let this be a big warning to everybody who is running with size = 2. The downtime and problems caused by missing objects/replicas are usually big and it takes days to recover from those. But very often data is lost and/or corrupted which causes even more problems. > > > > I can't stress this enough. Running with size = 2 in production is a SERIOUS hazard and should not be done imho. > > > > To anyone out there running with size = 2, please reconsider this! > > > > Thanks, > > > > Wido > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com