Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

Hans van den Bogert <hansbogert@xxxxxxxxx> · Mon, 16 Nov 2020 12:59:31 +0100

I think we're deviating from the original thread quite a bit and I would 
never argue that in a production environment with plenty OSDs you should 
go for R=2 or K+1, so my example cluster which happens to be 2+1 is a 
bit unlucky.

However I'm interested in the following

On 11/16/20 11:31 AM, Janne Johansson wrote:
> So while one could always say "one more drive is better than your
> amount", there are people losing data with repl=2 or K+1 because some
> more normal operation was in flight and _then_ a single surprise
> happens.  So you can have a weird reboot, causing those PGs needing
> backfill later, and if one of the uptodate hosts have any single
> surprise during the recovery, the cluster will lack some of the current
> data even if two disks were never down at the same time.

I'm not sure I follow, from a logical perspective they *are* down at the 
same time right? In your scenario 1 up-to-date  replica was left, but 
even that had a surprise. Okay well that's the risk you take with R=2, 
but it's not intrinsically different than R=3.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx