On 06/21/2018 11:11 AM, Jake Grimmett wrote: > Dear All, > > A bad disk controller appears to have damaged our cluster... > > # ceph health > HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent > > probing to find bad pg... > > # ceph health detail > HEALTH_ERR 10 scrub errors; Possible data damage: 10 pgs inconsistent > OSD_SCRUB_ERRORS 10 scrub errors > PG_DAMAGED Possible data damage: 10 pgs inconsistent > pg 4.1de is active+clean+inconsistent, acting > [333,367,315,36,241,280,200,439,182,121] > (SNIP...next 9 bad pg are listed similar to above) > > now looking for further detail... > > [root@ceph1 ~]# rados list-inconsistent-obj 4.1de > No scrub information available for pg 4.1de > error 2: (2) No such file or directory > > presumably we need to initiate a manual scrub...? > > # ceph pg scrub 4.1de > instructing pg 4.1des0 on osd.333 to scrub > > Current date/time is... > > # date +"%F %T" > 2018-06-21 09:57:27 > > now look at the osd log... > > # tail -2 ceph-osd.333.log > 2018-06-21 07:27:56.253 7f39a4423700 0 log_channel(cluster) log [DBG] : > 5.d27 deep-scrub starts > 2018-06-21 07:27:56.331 7f39a4423700 0 log_channel(cluster) log [DBG] : > 5.d27 deep-scrub ok > > Note the above date stamps, the scrub command appears to be ignored > > Any ideas on why this is happening, and what we can do to fix the error? Are any of the OSDs involved with that PG currently doing recovery? If so, they will ignore a scrub until the recovery has finished. Or set osd_scrub_during_recovery=true Wido > > Some background: > Cluster upgraded from Luminous (12.2.5) to Mimic (13.2.0) > Pool uses EC 8+2, 10 nodes, 450 x 8TB Bluestore OSD > > Any ideas gratefully received.. > > Jake > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com