RE: No Ceph Recovery : Is it a bug ?

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Thu, 19 May 2016 14:37:23 +0000

<< chooseleaf_vary_r = 0

Meaning you are not running with Hammer tunables.
We saw sometimes OSD stuck and not recovering because of this. May be this could be an issue , Sage ?

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Gaurav Bafna
Sent: Thursday, May 19, 2016 3:41 AM
To: Sage Weil
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: No Ceph Recovery : Is it a bug ?

On Thu, May 19, 2016 at 3:34 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 19 May 2016, Gaurav Bafna wrote:
>> Hi Sage & hzwulibin
>>
>> Many thanks for your reply :)
>>
>> Ceph osd tree : http://pastebin.com/3cC8brcF
>>
>> Crushmap : http://pastebin.com/K2BNSHys
>
> Can you paste the output from
>
>  ceph osd crush show-tunables

{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 0,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 22,
    "profile": "unknown",
    "optimal_tunables": 0,
    "legacy_tunables": 0,
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "require_feature_tunables3": 0,
    "has_v2_rules": 0,
    "has_v3_rules": 0,
    "has_v4_buckets": 0
}

>
>> Since I removed the 12 osds from the crush map, there are no more stuck PGs .
>
> Hmm...
>
>> Version : Hammer 0.94.5.    Do you think it can just be a reporting
>> issue ? The osds that went down had around 300 PGs mapped to them out
>> of which it was primary for 0 . We have two IDCs. All the primary are
>> in one IDC only for now.
>>
>> Sage, I did not get yor first point. What is not matching ?  In our
>> cluster there were 2758 running osds before the 12 osd crash . After
>> that event , there were 2746 osds left.
>>
>> PG Stat for your reference :   v3733136: 75680 pgs: 75680
>> active+clean; 409 GB data, 16745 GB used, 14299 TB / 14315 TB avail;
>> 443 B/s wr, 0 op/s
>>
>> Also, I have seen this issue 3 times with our cluster. Even when 1
>> osd goes down in IDC1 or IDC2, some PG always remain undersized. I am
>> sure this issue will come up once again . So can you tell me what
>> things should I look when an OSD goes down next time ?
>>
>> When I manually stop an osd daemon in IDC1 or IDC2 by running,
>> /etc/init.d/ceph  stop osd.x , the cluster recovers nicely. That is
>> what making the issue so complex.
>
> Here you say restarting an OSD was enough to clear the problem, but
> above you say you also removed the down OSDs from the CRUSH map.  Can
> you be clear about when the PGs stopped being undersized?

I didn't restart the OSD as the disk was gone bad. When I removed from crush map, the PGs stopped being undersized.

>
> I can't tell from this information whether it is a reporting issue or
> whether CRUSH to do a proper mapping on this cluster.
>
> If you can reproduce this, there are a few things to do:
>
> 1) Grab a copy of the OSD map:
>
>   get osd getmap -o osdmap
>
> 2) Get a list of undersized pgs:
>
>  ceph pg ls undersized > pgs.txt
>
> 3) query one of the undersized pgs:
>
>  tail pgs.txt
>  ceph tell <one of the pgids> query > query.txt
>
> 4) Share the result with us
>
>  ceph-post-file osdmap pgs.txt query.txt
>
> (or attach it to an email).

Sure. I will do that the  next time it occurs.

Very grateful for your help,
Gaurav
>
> Thanks!
> sage
>
>
>
>>
>> It is a big cluster and we have just started . We will need the
>> community help to make it run smoothly and expand further :)
>>
>> Thanks
>> Gaurav
>>
>> On Thu, May 19, 2016 at 1:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Thu, 19 May 2016, Gaurav Bafna wrote:
>> >> Hi Cephers ,
>> >>
>> >> In our production cluster at Reliance Jio, when as osd goes
>> >> corrupt and crashes, Cluster remains unhealthy even after 4 hours.
>> >>
>> >>     cluster fac04d85-db48-4564-b821-deebda046261
>> >>      health HEALTH_WARN
>> >>             658 pgs degraded
>> >>             658 pgs stuck degraded
>> >>             688 pgs stuck unclean
>> >>             658 pgs stuck undersized
>> >>             658 pgs undersized
>> >               ^^^ this...
>> >
>> >>             recovery 3064/1981308 objects degraded (0.155%)
>> >>             recovery 124/1981308 objects misplaced (0.006%)
>> >>      monmap e11: 11 mons at
>> >> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0}
>> >>             election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10
>> >> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6
>> >>      osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs
>> >                                doesn't match this ^^
>> >
>> > which makes it look like a problem with OSDs reporting PG state to
>> > the mon.  The fact that an OSD restarts supports that theory.
>> >
>> > What version is this?  A bunch of the osd -> mon pg reporting code
>> > was recently rewritten (between infernalis and jewel), so the new
>> > code is hopefully more robust.  (OTOH, it is also new, so we may
>> > have missed
>> > something.)
>> >
>> > Nice big cluster!
>> >
>> > sage
>>
>>
>>
>> --
>> Gaurav Bafna
>> 9540631400
>>
>>

--
Gaurav Bafna
9540631400
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f