Re: PG Stuck active+undersized+degraded+inconsistent

Christian Balzer <chibi@xxxxxxx> · Thu, 31 Mar 2016 09:24:01 +0900

On Wed, 30 Mar 2016 15:50:07 +0000 Calvin Morrow wrote:

> On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote:
> >
> > > Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due
> > > to a hardware error, however after normal recovery it seems stuck
> > > with one active+undersized+degraded+inconsistent pg.
> > >
> > Any reason (other than inertia, which I understand very well) you're
> > running a non LTS version that last saw bug fixes a year ago?
> > You may very well be facing a bug that has long been fixed even in
> > Firefly, let alone Hammer.
> >
> I know we discussed Hammer several times, and I don't remember the exact
> reason we held off.  Other than that, Inertia is probably the best
> answer I have.
> 
Fair enough. 

I just seem to remember similar scenarios where recovery got stuck/hung
and thus would assume it was fixed in newer versions.

If you google for "ceph recovery stuck" you find another potential
solution behind the RH paywall and this:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html

That would have been my next suggestion anyway, Ceph OSDs seem to take
well to the 'IT crowd' mantra of "Have you tried turning it off and on
again?". ^o^

> >
> > If so, hopefully one of the devs remembering it can pipe up.
> >
> > > I haven't been able to get repair to happen using "ceph pg repair
> > > 12.28a"; I can see the activity logged in the mon logs, however the
> > > repair doesn't actually seem to happen in any of the actual osd logs.
> > >
> > > I tried folowing Sebiastien's instructions for manually locating the
> > > inconsistent object (
> > > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> > ),
> > > however the md5sum from the objects both match, so I'm not quite
> > > sure how to proceed.
> > >
> > Rolling a dice? ^o^
> > Do they have similar (identical really) timestamps as well?
> >
> Yes, timestamps are identical.
> 
Unsurprisingly.

> >
> > > Any ideas on how to return to a healthy cluster?
> > >
> > > [root@soi-ceph2 ceph]# ceph status
> > >     cluster 6cc00165-4956-4947-8605-53ba51acd42b
> > >      health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023
> > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck
> > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects
> > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub
> > > errors monmap e5: 3 mons at {soi-ceph1=
> > > 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},
> > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3
> > >      osdmap e41120: 60 osds: 59 up, 59 in
> > >       pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728
> > > kobjects 91295 GB used, 73500 GB / 160 TB avail
> > >             132091/23742762 objects degraded (0.556%); 7745/23742762
> > > objects misplaced (0.033%)
> > >                60341 active+clean
> > >                   76 active+remapped
> > >                 1022 active+undersized+degraded
> > >                    1 active+undersized+degraded+inconsistent
> > >   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s
> > >
> > What's confusing to me in this picture are the stuck and unclean PGs as
> > well as degraded objects, it seems that recovery has stopped?
> >
> Yeah ... recovery essentially halted.  I'm sure its no accident that
> there are exactly 1023 (1024-1) unhealthy pgs.
> 
> >
> > Something else that suggests a bug, or at least a stuck OSD.
> >
> > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent
> > > pg 12.28a is stuck unclean for 126274.215835, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is stuck undersized for 3499.099747, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is stuck degraded for 3499.107051, current state
> > > active+undersized+degraded+inconsistent, last acting [36,52]
> > > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]
> > >
> > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz
> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700
> > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid
> > >
> > c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12
> > > candidate had a read error, digest 2029411064 != known digest
> > > 2692480864
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > That's the culprit, google for it. Of course the most promising looking
> > answer is behind the RH pay wall.
> >
> This part is the most confusing for me.  To me, this should indicate that
> there was some kind of bitrot on the disk (I'd love for ZFS to be better
> supported here).  What I don't understand is that the actual object has
> identical md5sums, timestamps, etc.  I don't know if this means there was
> just a transient error that Ceph can't get over, or whether I'm
> mistakenly looking at the wrong object.  Maybe something stored in an
> xattr somewhere?
> 
I could think of more scenarios, not knowing how either that checksum nor
mdsum work in detail.
Like one going through the pagecache, the other doesnt.
Or the checksum being corrupted, written out of order, etc.

And transient errors should hopefully respond well to an OSD restart.

> >
> > Looks like that disk has an issue, guess you're not seeing this on
> > osd.52, right?
> >
> Correct.
> 
> > Check osd.36's SMART status.
> >
> SMART is normal, no errors, all counters seem fine.
> 
If there would be an actual issue with the HDD, I'd expect to see at least
some Pending or Offline sectors.

> >
> > My guess is that you may have to set min_size to 1 and recover osd.35
> > as well, but don't take my word for it.
> >
> Thanks for the suggestion.  I'm holding out for the moment in case
> someone else reads this and has an "aha" moment.  At the moment, I'm not
> sure if it would be more dangerous to try and blow away the object on
> osd.36 and hope for recovery (with min_size 1) or try a software upgrade
> on an unhealthy cluster (yuck).
> 
Well, see above.

And yeah, neither of those two alternatives is particular alluring. 
OTOH, you're looking at just one object versus a whole PG or OSD.

Christian

> >
> > Christian
> >
> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700
> > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1
> > > inconsistent objects
> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700
> > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors
> > >
> > > [root@soi-ceph2 ceph]# md5sum
> > >
> > /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> > > \fb57b1f17421377bf2c35809f395e9b9
> > >
> > /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> > >
> > > [root@soi-ceph3 ceph]# md5sum
> > >
> > /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> > > \fb57b1f17421377bf2c35809f395e9b9
> > >
> > /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com