"ceph pg scrub" does not start

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Thu, 21 Jun 2018 10:11:09 +0100

Dear All,

A bad disk controller appears to have damaged our cluster...

# ceph health
HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent

probing to find bad pg...

# ceph health detail
HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent
OSD_SCRUB_ERRORS 10 scrub errors
PG_DAMAGED Possible data damage: 10 pgs inconsistent
    pg 4.1de is active+clean+inconsistent, acting
[333,367,315,36,241,280,200,439,182,121]
(SNIP...next 9 bad pg are listed similar to above)

now looking for further detail...

[root@ceph1 ~]# rados list-inconsistent-obj 4.1de
No scrub information available for pg 4.1de
error 2: (2) No such file or directory

presumably we need to initiate a manual scrub...?

# ceph pg scrub 4.1de
instructing pg 4.1des0 on osd.333 to scrub

Current date/time is...

# date +"%F %T"
2018-06-21 09:57:27

now look at the osd log...

# tail -2 ceph-osd.333.log
2018-06-21 07:27:56.253 7f39a4423700  0 log_channel(cluster) log [DBG] :
5.d27 deep-scrub starts
2018-06-21 07:27:56.331 7f39a4423700  0 log_channel(cluster) log [DBG] :
5.d27 deep-scrub ok

Note the above date stamps, the scrub command appears to be ignored

Any ideas on why this is happening, and what we can do to fix the error?

Some background:
Cluster upgraded from Luminous (12.2.5) to Mimic (13.2.0)
Pool uses EC 8+2, 10 nodes, 450 x 8TB Bluestore OSD

Any ideas gratefully received..

Jake

-- 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com