Re: HEALTH_ERR vs HEALTH_WARN

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 23 Aug 2018 09:51:27 -0700

On Thu, Aug 23, 2018 at 12:26 AM mj <lists@xxxxxxxxxxxxx> wrote:
Hi,

Thanks John and Gregory for your answers.

Gregory's answer worries us. We thought that with a 3/2 pool, and one PG 

corrupted, the assumption would be: the two similar ones are correct, 

and the third one needs to be adjusted.

Can we determine from this output, if I created corruption in our cluster..?

No, we can't tell. It's not very likely, though, and given that this is an rbd pool I'd bet the most likely inconsistency was a lost journal write which is long-since unneeded by the VM anyway. If it's a concern, you could try tracking down the rbd image (it's got the internal ID 2c191e238e1f29; I'm not sure what the command is to turn that into the front-end name) and running an fsck and any available application data scrubs.

In future I believe the most reliable way on Jewel is to simply go look at the objects and do the vote-counting yourself. Later versions of Ceph include more checksumming output etc to make it easier and to more reliably identify the broken copy to begin with.
-Greg

> root@pm1:~# ceph health detail

> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

> pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]

> 1 scrub errors

> root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*

> /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts

> /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok

> /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts

> /var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate had a read error

> /var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 inconsistent objects

> /var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 18088 : cluster [INF] 2.1a9 repair starts

> /var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 18089 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate had a read error

> /var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects

> /var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed

And also: jewel (which we're running) is considered "the old past" with 

the old non-checksum behaviour?

In case this occurs again... what would be the steps to determine WHICH 

pg is the corrupt one, and how to proceed it it happens to be the 

primary pg for an object?

Upgrading to luminous would prevent this from happening again i guess. 

We're a bit scared to upgrade, because there seem to be so many issues 

with luminous and upgrading to it.

Having said all this: we are surprised to see this is on our cluster, as 

it should be and has been running stable and reliably for over two 

years. Perhaps just a one-time glitch.

Thanks for your replies!

MJ

On 08/23/2018 01:06 AM, Gregory Farnum wrote:

> On Wed, Aug 22, 2018 at 2:46 AM John Spray <jspray@xxxxxxxxxx 

> <mailto:jspray@xxxxxxxxxx>> wrote:

> 

>     On Wed, Aug 22, 2018 at 7:57 AM mj <lists@xxxxxxxxxxxxx

>     <mailto:lists@xxxxxxxxxxxxx>> wrote:

>      >

>      > Hi,

>      >

>      > This morning I woke up, seeing my ceph jewel 10.2.10 cluster in

>      > HEALTH_ERR state. That helps you getting out of bed. :-)

>      >

>      > Anyway, much to my surprise, all VMs  running on the cluster were

>     still

>      > working like nothing was going on. :-)

>      >

>      > Checking a bit more reveiled:

>      >

>      > > root@pm1:~# ceph -s

>      > >     cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1

>      > >      health HEALTH_ERR

>      > >             1 pgs inconsistent

>      > >             1 scrub errors

>      > >      monmap e3: 3 mons at

>     {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0

>     <http://10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0>}

>      > >             election epoch 296, quorum 0,1,2 0,1,2

>      > >      osdmap e12662: 24 osds: 24 up, 24 in

>      > >             flags sortbitwise,require_jewel_osds

>      > >       pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680

>     kobjects

>      > >             44027 GB used, 45353 GB / 89380 GB avail

>      > >                 1087 active+clean

>      > >                    1 active+clean+inconsistent

>      > >   client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr

>      > > root@pm1:~# ceph health detail

>      > > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

>      > > pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]

>      > > 1 scrub errors

>      > > root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*

>      > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15

>     10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2122 : cluster

>     [INF] 2.1a9 deep-scrub starts

>      > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15

>     10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2123 : cluster

>     [INF] 2.1a9 deep-scrub ok

>      > > /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18074 : cluster

>     [INF] 2.1a9 deep-scrub starts

>      > > /var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18075 : cluster

>     [ERR] 2.1a9 shard 23: soid

>     2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate

>     had a read error

>      > > /var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18076 : cluster

>     [ERR] 2.1a9 deep-scrub 0 missing, 1 inconsistent objects

>      >

>      > ok, according to the docs I should do "ceph pg repair 2.1a9". Did

>     that,

>      > and some minutes later the cluster came back to "HEALTH_OK"

>      >

>      > Checking the logs:

>      > > /var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18088 : cluster

>     [INF] 2.1a9 repair starts

>      > > /var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18089 : cluster

>     [ERR] 2.1a9 shard 23: soid

>     2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate

>     had a read error

>      > > /var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18090 : cluster

>     [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects

>      > > /var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15

>     10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18091 : cluster

>     [ERR] 2.1a9 repair 1 errors, 1 fixed

>      >

>      > So, we are fine again, it seems.

>      >

>      > But now my question: can anyone what happened? Is one of my disks

>     dying?

>      > In the proxmox gui, all osd disks are SMART status "OK".

>      >

>      > Besides that, as the cluster was still running and the fix was

>      > relatively simple, would a HEALTH_WARN not have been more

>     appropriate?

> 

>     An inconsistent PG generally implies data corruption, which is usually

>     pretty scary.  Your cluster may have been running okay for the moment,

>     but things might not be so good if your workload happens to touch that

>     one inconsistent object.

> 

>     This is a subjective thing, and sometimes users aren't so worried

>     about inconsistency:

>       - known-unreliable hardware, and are expecting to encounter periodic

>     corruptions.

>       - pools that are just for dev/test, where corruption is not an

>     urgent issue

> 

>     In those cases, they might need to do some external filtering of

>     health checks, possibly down-grading the PG_DAMAGED check.

> 

>      > And, since this is a size 3, min 2 pool... shouldn't this have been

>      > taken care of automatically..? ('self-healing' and all that..?)

> 

>     The good news is that there's an osd_scrub_auto_repair option (default

>     is false).

> 

>     I imagine there was probably some historical debate about whether that

>     should be on by default, core RADOS folks probably know more.

> 

> 

> In the past, "recovery" merely forced all the replicas into alignment 

> with the primary. If the primary was the bad copy...well, too bad!

> 

> Things are much better now that we have checksums in various places and 

> take more care about it. But it's still possible to configure and use 

> Ceph so that we don't know what the right answer is, and these kinds of 

> issues really aren't supposed to turn up, so we don't yet feel 

> comfortable auto-repairing.

> -Greg

> 

> 

>     John

> 

> 

>      > So, I'm having my morning coffee finally, wondering what

>     happened... :-)

>      >

>      > Best regards to all, have a nice day!

>      >

>      > MJ

>      > _______________________________________________

>      > ceph-users mailing list

>      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com