Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

Janne Johansson <icepic.dz@xxxxxxxxx> · Mon, 16 Nov 2020 14:36:18 +0100

>
> However I'm interested in the following
>
> On 11/16/20 11:31 AM, Janne Johansson wrote:
>  > So while one could always say "one more drive is better than your
>  > amount", there are people losing data with repl=2 or K+1 because some
>  > more normal operation was in flight and _then_ a single surprise
>  > happens.  So you can have a weird reboot, causing those PGs needing
>  > backfill later, and if one of the uptodate hosts have any single
>  > surprise during the recovery, the cluster will lack some of the current
>  > data even if two disks were never down at the same time.
>
> I'm not sure I follow, from a logical perspective they *are* down at the
> same time right? In your scenario 1 up-to-date  replica was left, but
> even that had a surprise. Okay well that's the risk you take with R=2,
> but it's not intrinsically different than R=3.
>

I was trying to describe something like this
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013237.html
There are more posts from ceph consultants that get called in after someone
with "only" R=2/EC K+1 seeing data loss, but I didn't dig them all up.

Ie, a kind of split-brain scenario where a small fault/outage on one of the
drives and later a bigger fault on another will hurt you in R=2 or K+1
scenarios, even if you don't have two full faults, only one that is
temporarily out and the second being the one "real" fault which could be
"disk died" or something which we usually imagine as the scenario to handle
for raids or replication sizes.

Not trying to say you don't understand this, but rather that people who run
small ceph clusters tend to start out with R=2 or K+1 EC because the larger
faults are easier to imagine.

When you have R=3 and you move one of the 3 PG copies for disk resize or
something, then you are temporarily reduced to two copies (at least two
up-to-date copies if writes are happening during the move), so you can
still bear one surprise until this is completed without losing data. With
R=2/EC K+1 not so much.

Also, me calling it small and large faults mean that there is a huge
difference in "a few PGs with issues" and "disk completely broken", but if
the PGs are for a pool with disk images, then all images on the pool are
prone to have errors and not just "we lost 1% of the files, we can get only
those back", but rather the disk images are all having random holes in them.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx