Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

Frank Schilder <frans@xxxxxx> · Mon, 16 Nov 2020 12:17:26 +0000

To throw in my 5 cents. Choosing m in k+m EC replication is not random and the argument that anyone with larger m could always say lower m is wrong is also not working.

Why are people recommending m>=2 for production (or R>=3 replicas)?

Its very simple. What is forgotten below is maintenance. Whenever you do maintenance on ceph, there will be longer episodes of degraded redundancy as OSDs are down. However, on production storage systems, writes *always* need to go to redundant storage. Hence, minimum redundancy under maintenance is the keyword here.

With m=1 (R=2) one could never do any maintenance without down time as shutting down just 1 OSD would imply writes to non-redundant storage, which in turn would mean data loss in case a disk dies during maintenance.

Basically, with m parity shards you can do maintenance on m-1 failure domains at the same time without downtime or non-redundant writes. With R copies you can do maintenance on R-2 failure domains without downtime.

If your SLAs require higher minimum redundancy at all times, m (R) need to be large enough to allow maintenance unless you do downtime. However, the latter would be odd, because one of the key features of ceph is its ability to provides infinite uptime while hardware gets renewed all the time.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Hans van den Bogert <hansbogert@xxxxxxxxx>
Sent: 16 November 2020 12:59:31
Cc: ceph-users
Subject:  Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

I think we're deviating from the original thread quite a bit and I would
never argue that in a production environment with plenty OSDs you should
go for R=2 or K+1, so my example cluster which happens to be 2+1 is a
bit unlucky.

However I'm interested in the following

On 11/16/20 11:31 AM, Janne Johansson wrote:
 > So while one could always say "one more drive is better than your
 > amount", there are people losing data with repl=2 or K+1 because some
 > more normal operation was in flight and _then_ a single surprise
 > happens.  So you can have a weird reboot, causing those PGs needing
 > backfill later, and if one of the uptodate hosts have any single
 > surprise during the recovery, the cluster will lack some of the current
 > data even if two disks were never down at the same time.

I'm not sure I follow, from a logical perspective they *are* down at the
same time right? In your scenario 1 up-to-date  replica was left, but
even that had a surprise. Okay well that's the risk you take with R=2,
but it's not intrinsically different than R=3.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx