Re: Upgrading ceph with HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent

John Spray <jspray@xxxxxxxxxx> · Wed, 5 Sep 2018 11:41:19 +0100

On Wed, Sep 5, 2018 at 8:38 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:
>
>
> The adviced solution is to upgrade ceph only in HEALTH_OK state. And I
> also read somewhere that is bad to have your cluster for a long time in
> an HEALTH_ERR state.
>
> But why is this bad?

Aside from the obvious (errors are bad things!), many people have
external monitoring systems that will alert them on the transitions
between OK/WARN/ERR.  If the system is stuck in ERR for a long time,
they are unlikely to notice new errors or warnings.  These systems can
accumulate faults without the operator noticing.

> Why is this bad during upgrading?

It depends what's gone wrong.  For example:
 - If your cluster is degraded (fewer than desired number of replicas
of data) then taking more services offline (even briefly) to do an
upgrade will create greater risk to the data by reducing the number of
copies available.
- If your system is in an error state because something has gone bad
on disk, then recovering it with the same software that wrote the data
is a more tested code path than running some newer code against a
system left in a strange state by an older version.

There will always be exceptions to this (e.g. where the upgrade is the
fix for whatever caused the error), but the general purpose advice is
to get a system nice and clean before starting the upgrade.

John

> Can I quantify how bad it is? (like with large log/journal file?)
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com