Re: PG Stuck active+undersized+degraded+inconsistent

Calvin Morrow <calvin.morrow@xxxxxxxxx> · Wed, 30 Mar 2016 15:50:07 +0000

On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote:

> Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due to a

> hardware error, however after normal recovery it seems stuck with

> one active+undersized+degraded+inconsistent pg.

>

Any reason (other than inertia, which I understand very well) you're

running a non LTS version that last saw bug fixes a year ago?

You may very well be facing a bug that has long been fixed even in Firefly,

let alone Hammer.
I know we discussed Hammer several times, and I don't remember the exact reason we held off.  Other than that, Inertia is probably the best answer I have. 

If so, hopefully one of the devs remembering it can pipe up.

> I haven't been able to get repair to happen using "ceph pg repair

> 12.28a"; I can see the activity logged in the mon logs, however the

> repair doesn't actually seem to happen in any of the actual osd logs.

>

> I tried folowing Sebiastien's instructions for manually locating the

> inconsistent object (

> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/),

> however the md5sum from the objects both match, so I'm not quite sure how

> to proceed.

>

Rolling a dice? ^o^

Do they have similar (identical really) timestamps as well?
Yes, timestamps are identical. 

> Any ideas on how to return to a healthy cluster?

>

> [root@soi-ceph2 ceph]# ceph status

>     cluster 6cc00165-4956-4947-8605-53ba51acd42b

>      health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023 pgs

> stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck undersized; 1023

> pgs undersized; recovery 132091/23742762 objects degraded (0.556%);

> 7745/23742762 objects misplaced (0.033%); 1 scrub errors

>      monmap e5: 3 mons at {soi-ceph1=

> 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},

> election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3

>      osdmap e41120: 60 osds: 59 up, 59 in

>       pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728 kobjects

>             91295 GB used, 73500 GB / 160 TB avail

>             132091/23742762 objects degraded (0.556%); 7745/23742762

> objects misplaced (0.033%)

>                60341 active+clean

>                   76 active+remapped

>                 1022 active+undersized+degraded

>                    1 active+undersized+degraded+inconsistent

>   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s

>

What's confusing to me in this picture are the stuck and unclean PGs as

well as degraded objects, it seems that recovery has stopped?
Yeah ... recovery essentially halted.  I'm sure its no accident that there are exactly 1023 (1024-1) unhealthy pgs. 

Something else that suggests a bug, or at least a stuck OSD.

> [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent

> pg 12.28a is stuck unclean for 126274.215835, current state

> active+undersized+degraded+inconsistent, last acting [36,52]

> pg 12.28a is stuck undersized for 3499.099747, current state

> active+undersized+degraded+inconsistent, last acting [36,52]

> pg 12.28a is stuck degraded for 3499.107051, current state

> active+undersized+degraded+inconsistent, last acting [36,52]

> pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]

>

> [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz

> ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700 -1

> log_channel(default) log [ERR] : 12.28a shard 20: soid

> c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12

> candidate had a read error, digest 2029411064 != known digest 2692480864

  ^^^^^^^^^^^^^^^^^^^^^^^^^^

That's the culprit, google for it. Of course the most promising looking

answer is behind the RH pay wall.
This part is the most confusing for me.  To me, this should indicate that there was some kind of bitrot on the disk (I'd love for ZFS to be better supported here).  What I don't understand is that the actual object has identical md5sums, timestamps, etc.  I don't know if this means there was just a transient error that Ceph can't get over, or whether I'm mistakenly looking at the wrong object.  Maybe something stored in an xattr somewhere?

Looks like that disk has an issue, guess you're not seeing this on osd.52,

right?
Correct. 

Check osd.36's SMART status.
SMART is normal, no errors, all counters seem fine. 

My guess is that you may have to set min_size to 1 and recover osd.35 as

well, but don't take my word for it.
Thanks for the suggestion.  I'm holding out for the moment in case someone else reads this and has an "aha" moment.  At the moment, I'm not sure if it would be more dangerous to try and blow away the object on osd.36 and hope for recovery (with min_size 1) or try a software upgrade on an unhealthy cluster (yuck).

Christian

> ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700 -1

> log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1

> inconsistent objects

> ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700 -1

> log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors

>

> [root@soi-ceph2 ceph]# md5sum

> /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> \fb57b1f17421377bf2c35809f395e9b9

>  /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

>

> [root@soi-ceph3 ceph]# md5sum

> /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> \fb57b1f17421377bf2c35809f395e9b9

>  /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com