Re: PG Stuck active+undersized+degraded+inconsistent

Calvin Morrow <calvin.morrow@xxxxxxxxx> · Sat, 02 Apr 2016 01:52:41 +0000

On Fri, Apr 1, 2016 at 4:42 PM Bob R <bobr@xxxxxxxxxxxxxx> wrote:
Calvin,
What does your crushmap look like?
ceph osd tree
[root@soi-ceph1 ~]# ceph osd tree
# id	weight	type name	up/down	reweight
-1	163.8	root default
-2	54.6		host soi-ceph1
0	2.73			osd.0	up	1
5	2.73			osd.5	up	1
10	2.73			osd.10	up	1
15	2.73			osd.15	up	1
20	2.73			osd.20	down	0
25	2.73			osd.25	up	1
30	2.73			osd.30	up	1
35	2.73			osd.35	up	1
40	2.73			osd.40	up	1
45	2.73			osd.45	up	1
50	2.73			osd.50	up	1
55	2.73			osd.55	up	1
60	2.73			osd.60	up	1
65	2.73			osd.65	up	1
70	2.73			osd.70	up	1
75	2.73			osd.75	up	1
80	2.73			osd.80	up	1
85	2.73			osd.85	up	1
90	2.73			osd.90	up	1
95	2.73			osd.95	up	1
-3	54.6		host soi-ceph2
1	2.73			osd.1	up	1
6	2.73			osd.6	up	1
11	2.73			osd.11	up	1
16	2.73			osd.16	up	1
21	2.73			osd.21	up	1
26	2.73			osd.26	up	1
31	2.73			osd.31	up	1
36	2.73			osd.36	up	1
41	2.73			osd.41	up	1
46	2.73			osd.46	up	1
51	2.73			osd.51	up	1
56	2.73			osd.56	up	1
61	2.73			osd.61	up	1
66	2.73			osd.66	up	1
71	2.73			osd.71	up	1
76	2.73			osd.76	up	1
81	2.73			osd.81	up	1
86	2.73			osd.86	up	1
91	2.73			osd.91	up	1
96	2.73			osd.96	up	1
-4	54.6		host soi-ceph3
2	2.73			osd.2	up	1
7	2.73			osd.7	up	1
12	2.73			osd.12	up	1
17	2.73			osd.17	up	1
22	2.73			osd.22	up	1
27	2.73			osd.27	up	1
32	2.73			osd.32	up	1
37	2.73			osd.37	up	1
42	2.73			osd.42	up	1
47	2.73			osd.47	up	1
52	2.73			osd.52	up	1
57	2.73			osd.57	up	1
62	2.73			osd.62	up	1
67	2.73			osd.67	up	1
72	2.73			osd.72	up	1
77	2.73			osd.77	up	1
82	2.73			osd.82	up	1
87	2.73			osd.87	up	1
92	2.73			osd.92	up	1
97	2.73			osd.97	up	1
-5	0		host soi-ceph4
-6	0		host soi-ceph5 

I find it strange that 1023 PGs are undersized when only one OSD failed. 

Bob

On Thu, Mar 31, 2016 at 9:27 AM, Calvin Morrow <calvin.morrow@xxxxxxxxx> wrote:

On Wed, Mar 30, 2016 at 5:24 PM Christian Balzer <chibi@xxxxxxx> wrote:
On Wed, 30 Mar 2016 15:50:07 +0000 Calvin Morrow wrote:

> On Wed, Mar 30, 2016 at 1:27 AM Christian Balzer <chibi@xxxxxxx> wrote:

>

> >

> > Hello,

> >

> > On Tue, 29 Mar 2016 18:10:33 +0000 Calvin Morrow wrote:

> >

> > > Ceph cluster with 60 OSDs, Giant 0.87.2.  One of the OSDs failed due

> > > to a hardware error, however after normal recovery it seems stuck

> > > with one active+undersized+degraded+inconsistent pg.

> > >

> > Any reason (other than inertia, which I understand very well) you're

> > running a non LTS version that last saw bug fixes a year ago?

> > You may very well be facing a bug that has long been fixed even in

> > Firefly, let alone Hammer.

> >

> I know we discussed Hammer several times, and I don't remember the exact

> reason we held off.  Other than that, Inertia is probably the best

> answer I have.

>

Fair enough.

I just seem to remember similar scenarios where recovery got stuck/hung

and thus would assume it was fixed in newer versions.

If you google for "ceph recovery stuck" you find another potential

solution behind the RH paywall and this:

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043894.html

That would have been my next suggestion anyway, Ceph OSDs seem to take

well to the 'IT crowd' mantra of "Have you tried turning it off and on

again?". ^o^
Yeah, unfortunately that was something I tried before reaching out on the mailing list.  It didn't seem to change anything.

In particular, I was noticing that my "ceph pg repair 12.28a" command never seemed to be acknowledged by the OSD.  I was hoping for some sort of log message, even an 'ERR', but while I saw messages about other pg scrubs, nothing shows up for the problem PG.  I tried before and after an OSD restart (both OSDs) without any apparent change. 

> >

> > If so, hopefully one of the devs remembering it can pipe up.

> >

> > > I haven't been able to get repair to happen using "ceph pg repair

> > > 12.28a"; I can see the activity logged in the mon logs, however the

> > > repair doesn't actually seem to happen in any of the actual osd logs.

> > >

> > > I tried folowing Sebiastien's instructions for manually locating the

> > > inconsistent object (

> > > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

> > ),

> > > however the md5sum from the objects both match, so I'm not quite

> > > sure how to proceed.

> > >

> > Rolling a dice? ^o^

> > Do they have similar (identical really) timestamps as well?

> >

> Yes, timestamps are identical.

>

Unsurprisingly.

> >

> > > Any ideas on how to return to a healthy cluster?

> > >

> > > [root@soi-ceph2 ceph]# ceph status

> > >     cluster 6cc00165-4956-4947-8605-53ba51acd42b

> > >      health HEALTH_ERR 1023 pgs degraded; 1 pgs inconsistent; 1023

> > > pgs stuck degraded; 1099 pgs stuck unclean; 1023 pgs stuck

> > > undersized; 1023 pgs undersized; recovery 132091/23742762 objects

> > > degraded (0.556%); 7745/23742762 objects misplaced (0.033%); 1 scrub

> > > errors monmap e5: 3 mons at {soi-ceph1=

> > > 10.2.2.11:6789/0,soi-ceph2=10.2.2.12:6789/0,soi-ceph3=10.2.2.13:6789/0},

> > > election epoch 4132, quorum 0,1,2 soi-ceph1,soi-ceph2,soi-ceph3

> > >      osdmap e41120: 60 osds: 59 up, 59 in

> > >       pgmap v37432002: 61440 pgs, 15 pools, 30513 GB data, 7728

> > > kobjects 91295 GB used, 73500 GB / 160 TB avail

> > >             132091/23742762 objects degraded (0.556%); 7745/23742762

> > > objects misplaced (0.033%)

> > >                60341 active+clean

> > >                   76 active+remapped

> > >                 1022 active+undersized+degraded

> > >                    1 active+undersized+degraded+inconsistent

> > >   client io 44548 B/s rd, 19591 kB/s wr, 1095 op/s

> > >

> > What's confusing to me in this picture are the stuck and unclean PGs as

> > well as degraded objects, it seems that recovery has stopped?

> >

> Yeah ... recovery essentially halted.  I'm sure its no accident that

> there are exactly 1023 (1024-1) unhealthy pgs.

>

> >

> > Something else that suggests a bug, or at least a stuck OSD.

> >

> > > [root@soi-ceph2 ceph]# ceph health detail | grep inconsistent

> > > pg 12.28a is stuck unclean for 126274.215835, current state

> > > active+undersized+degraded+inconsistent, last acting [36,52]

> > > pg 12.28a is stuck undersized for 3499.099747, current state

> > > active+undersized+degraded+inconsistent, last acting [36,52]

> > > pg 12.28a is stuck degraded for 3499.107051, current state

> > > active+undersized+degraded+inconsistent, last acting [36,52]

> > > pg 12.28a is active+undersized+degraded+inconsistent, acting [36,52]

> > >

> > > [root@soi-ceph2 ceph]# zgrep 'ERR' *.gz

> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:00:43.568221 7fe7b2897700

> > > -1 log_channel(default) log [ERR] : 12.28a shard 20: soid

> > >

> > c5cf428a/default.64340.11__shadow_.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO_106/head//12

> > > candidate had a read error, digest 2029411064 != known digest

> > > 2692480864

> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^

> > That's the culprit, google for it. Of course the most promising looking

> > answer is behind the RH pay wall.

> >

> This part is the most confusing for me.  To me, this should indicate that

> there was some kind of bitrot on the disk (I'd love for ZFS to be better

> supported here).  What I don't understand is that the actual object has

> identical md5sums, timestamps, etc.  I don't know if this means there was

> just a transient error that Ceph can't get over, or whether I'm

> mistakenly looking at the wrong object.  Maybe something stored in an

> xattr somewhere?

>

I could think of more scenarios, not knowing how either that checksum nor

mdsum work in detail.

Like one going through the pagecache, the other doesnt.

Or the checksum being corrupted, written out of order, etc.

And transient errors should hopefully respond well to an OSD restart.

Unfortunately not this time. 

> >

> > Looks like that disk has an issue, guess you're not seeing this on

> > osd.52, right?

> >

> Correct.

>

> > Check osd.36's SMART status.

> >

> SMART is normal, no errors, all counters seem fine.

>

If there would be an actual issue with the HDD, I'd expect to see at least

some Pending or Offline sectors.

> >

> > My guess is that you may have to set min_size to 1 and recover osd.35

> > as well, but don't take my word for it.

> >

> Thanks for the suggestion.  I'm holding out for the moment in case

> someone else reads this and has an "aha" moment.  At the moment, I'm not

> sure if it would be more dangerous to try and blow away the object on

> osd.36 and hope for recovery (with min_size 1) or try a software upgrade

> on an unhealthy cluster (yuck).

>

Well, see above.

And yeah, neither of those two alternatives is particular alluring.

OTOH, you're looking at just one object versus a whole PG or OSD.

The more I think about it, the more I seem to be convincing myself that your argument about it being a software error seems more likely.  That makes the option of setting min_size less appealing, because I have doubts that even ridding myself of that object will be acted on appropriately.

I think I'll look more into previous 'stuck recovery' issues and see how they were handled.  If the consensus for those was 'upgrade' even amidst an unhealthy status, we'll probably try that route. 

Christian

> >

> > Christian

> >

> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970413 7fe7b2897700

> > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 0 missing, 1

> > > inconsistent objects

> > > ceph-osd.36.log-20160325.gz:2016-03-24 12:01:25.970423 7fe7b2897700

> > > -1 log_channel(default) log [ERR] : 12.28a deep-scrub 1 errors

> > >

> > > [root@soi-ceph2 ceph]# md5sum

> > >

> > /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> > > \fb57b1f17421377bf2c35809f395e9b9

> > >

> > /var/lib/ceph/osd/ceph-36/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> > >

> > > [root@soi-ceph3 ceph]# md5sum

> > >

> > /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> > > \fb57b1f17421377bf2c35809f395e9b9

> > >

> > /var/lib/ceph/osd/ceph-52/current/12.28a_head/DIR_A/DIR_8/DIR_2/DIR_4/default.64340.11\\u\\ushadow\\u.VR0pEp1Nea8buLSqa9TGhLFZQ6co3KO\\u106__head_C5CF428A__c

> >

> >

> > --

> > Christian Balzer        Network/Systems Engineer

> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

> > http://www.gol.com/

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com