PG Stuck active+undersized+degraded+inconsistent

Calvin Morrow <calvin.morrow@xxxxxxxxx> · Tue, 29 Mar 2016 18:10:33 +0000

Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due to a hardware error, however after normal recovery it seems stuck with one active+undersized+degraded+inconsistent pg.
I haven't been able to get repair to happen using "ceph pg repair 12.28a";  I can see the activity logged in the mon logs, however the repair doesn't actually seem to happen in any of the actual osd logs.

I tried folowing Sebiastien's instructions for manually locating the inconsistent object (http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/), however the md5sum from the objects both match, so I'm not quite sure how to proceed.

Any ideas on how to return to a healthy cluster?

[root@soi-ceph2 ceph]# ceph status
    cluster 6cc00165-4956-4947-8605-53ba51acd42b
     health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck undersized; 1023 pgs undersized; recovery 132091/23742762 objects degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub errors
     monmap e5: 3 mons at {soi-ceph1=10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0}, election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
     osdmap e41120: 60 osds: 59 up, 59 in
      pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 kobjects
            91295 GB used, 73500 GB / 160 TB avail
            132091/23742762 objects degraded (0.556%); 7745/23742762 objects misplaced (0.033%)
               60341 active+clean
                  76 active+remapped
                1022 active+undersized+degraded
                   1 active+undersized+degraded+inconsistent
  client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s

[root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
pg 12.28a is stuck unclean for 126274.215835, current state active+undersized+degraded+inconsistent, last acting [36,52]
pg 12.28a is stuck undersized for 3499.099747, current state active+undersized+degraded+inconsistent, last acting [36,52]
pg 12.28a is stuck degraded for 3499.107051, current state active+undersized+degraded+inconsistent, last acting [36,52]
pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]

[root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 -1 log_channel(default) log [ERR] : 12.28a shard 20: soid c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12 candidate had a read error, digest 2029411064 != known digest 2692480864
ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700 -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1 inconsistent objects
ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700 -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors

[root@soi-ceph2 ceph]# md5sum /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
\fb57b1f17421377bf2c35809f395e9b9  /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

[root@soi-ceph3 ceph]# md5sum /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
\fb57b1f17421377bf2c35809f395e9b9  /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com