Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

Phil Merricks <seffyroff@xxxxxxxxx> · Mon, 16 Nov 2020 16:51:42 -0800

Thanks for all the replies folks.  I think it's testament to the
versatility of Ceph that there are some differences of opinion and
experience here.

With regards to the purpose of this cluster, it is providing distributed
storage for stateful workloads of containers.  The data produced is
somewhat immutable, it can be regenerated over time, however that does
cause some slowdown for the teams that use the data as part of their
development pipeline.  To the best of my understanding the goals here were
to provide a data loss safety net but still make efficient use of the block
devices assigned to the cluster, which is I imagine where the EC direction
came from.  The cluster is 3 nodes with the OSDs themselves mainly housed
in two of those.  Additionally there was an initiative to 'use what we
have' (or as I like to put it, 'cobble it together') with commodity
hardware that was immediately available to hand.  The departure of my
predecessor has left some unanswered questions so I am not going to bother
second guessing beyond what I already know.  As I understand it my steps
are:

1:  Move off the data and scrap the cluster as it stands currently.
(already under way)
2:  Group the block devices into pools of the same geometry and type (and
maybe do some tiering?)
3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily
compromised by a loss at the bare metal level
4. Add more hosts/OSDs if EC is the right solution (this may be outside of
the scope of this implementation, but I'll keep a-cobblin'!)

The additional ceph outputs follow:
ceph osd tree <https://termbin.com/vq63>
ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h>

I am fully prepared to do away with EC to keep things simple and efficient
in terms of CPU occupancy.

On Mon, 16 Nov 2020 at 02:32, Janne Johansson <icepic.dz@xxxxxxxxx> wrote:

> Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert <
> hansbogert@xxxxxxxxx>:
>
> > > With this profile you can only loose one OSD at a time, which is really
> > > not that redundant.
> > That's rather situation dependent. I don't have really large disks, so
> > the repair time isn't that large.
> > Further, my SLO isn't that high that I need 99.xxx% uptime, if 2 disks
> > break in the same repair window, that would be unfortunate, but I'd just
> > grab a backup from a mirroring cluster. Looking at it from another
> > perspective, I came from a single host RAID5 scenario, I'd argue this is
> > better since I can survive a host failure.
> >
> > Also this is a sliding problem right? Someone with K+3 could argue K+2
> >   is not enough as well.
> >
>
> There are a few situations like when you are moving data or when a scrub
> found a bad PG where you are suddenly out of copies in case something bad
> happens. I think Raid5 operators also found this out, when your cold spare
> disk kicks in, you find that old undetected error on one of the other disks
> and think repairs are bad or stress your raid too much.
>
> As with raids, the cheapest resource is often the actual disks and not
> operator time, restore-wait-times and so on, so that is why many on this
> list advocates for K+2-or-more, or Repl=3 because we have seen the errors
> one normally didn't expect. Yes, a double surprise of two disks failing in
> the same night after running for years is uncommon, but it is not as
> uncommon to resize pools, move PGs around or find a scrub error or two some
> day.
>
> So while one could always say "one more drive is better than your amount",
> there are people losing data with repl=2 or K+1 because some more normal
> operation was in flight and _then_ a single surprise happens.  So you can
> have a weird reboot, causing those PGs needing backfill later, and if one
> of the uptodate hosts have any single surprise during the recovery, the
> cluster will lack some of the current data even if two disks were never
> down at the same time.
>
> Drive manufacturers print Mean Time Between Failures, storage admins count
> Mean Time Between Surprises..
>
> --
> May the most significant bit of your life be positive.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx