HEALTH_ERR vs HEALTH_WARN

mj <lists@xxxxxxxxxxxxx> · Wed, 22 Aug 2018 08:56:38 +0200

Hi,

This morning I woke up, seeing my ceph jewel 10.2.10 cluster in 
HEALTH_ERR state. That helps you getting out of bed. :-)

Anyway, much to my surprise, all VMs  running on the cluster were still 
working like nothing was going on. :-)

Checking a bit more reveiled:

root@pm1:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_ERR
            1 pgs inconsistent
            1 scrub errors
     monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 296, quorum 0,1,2 0,1,2
     osdmap e12662: 24 osds: 24 up, 24 in
            flags sortbitwise,require_jewel_osds
      pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680 kobjects
            44027 GB used, 45353 GB / 89380 GB avail
                1087 active+clean
                   1 active+clean+inconsistent
  client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr
root@pm1:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]
1 scrub errors
root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate had a read error
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 inconsistent objects

ok, according to the docs I should do "ceph pg repair 2.1a9". Did that, 
and some minutes later the cluster came back to "HEALTH_OK"

Checking the logs:
/var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 18088 : cluster [INF] 2.1a9 repair starts
/var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 18089 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.00000000000c7c9d:head candidate had a read error
/var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects
/var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed

So, we are fine again, it seems.

But now my question: can anyone what happened? Is one of my disks dying? 
In the proxmox gui, all osd disks are SMART status "OK".

Besides that, as the cluster was still running and the fix was 
relatively simple, would a HEALTH_WARN not have been more appropriate?

And, since this is a size 3, min 2 pool... shouldn't this have been 
taken care of automatically..? ('self-healing' and all that..?)

So, I'm having my morning coffee finally, wondering what happened... :-)

Best regards to all, have a nice day!

MJ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com