Re: No Ceph Recovery : Is it a bug ?

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 19 May 2016 06:04:01 -0400 (EDT)

On Thu, 19 May 2016, Gaurav Bafna wrote:
> Hi Sage & hzwulibin
> 
> Many thanks for your reply :)
> 
> Ceph osd tree : http://pastebin.com/3cC8brcF
> 
> Crushmap : http://pastebin.com/K2BNSHys

Can you paste the output from

 ceph osd crush show-tunables

> Since I removed the 12 osds from the crush map, there are no more stuck PGs .

Hmm...

> Version : Hammer 0.94.5.    Do you think it can just be a reporting
> issue ? The osds that went down had around 300 PGs mapped to them out
> of which it was primary for 0 . We have two IDCs. All the primary are
> in one IDC only for now.
> 
> Sage, I did not get yor first point. What is not matching ?  In our
> cluster there were 2758 running osds before the 12 osd crash . After
> that event , there were 2746 osds left.
> 
> PG Stat for your reference :   v3733136: 75680 pgs: 75680
> active+clean; 409 GB data, 16745 GB used, 14299 TB / 14315 TB avail;
> 443 B/s wr, 0 op/s
> 
> Also, I have seen this issue 3 times with our cluster. Even when 1 osd
> goes down in IDC1 or IDC2, some PG always remain undersized. I am sure
> this issue will come up once again . So can you tell me what things
> should I look when an OSD goes down next time ?
> 
> When I manually stop an osd daemon in IDC1 or IDC2 by running,
> /etc/init.d/ceph  stop osd.x , the cluster recovers nicely. That is
> what making the issue so complex.

Here you say restarting an OSD was enough to clear the problem, but 
above you say you also removed the down OSDs from the CRUSH map.  Can you 
be clear about when the PGs stopped being undersized?

I can't tell from this information whether it is a reporting issue or 
whether CRUSH to do a proper mapping on this cluster.

If you can reproduce this, there are a few things to do:

1) Grab a copy of the OSD map:

  get osd getmap -o osdmap

2) Get a list of undersized pgs:

 ceph pg ls undersized > pgs.txt

3) query one of the undersized pgs:

 tail pgs.txt
 ceph tell <one of the pgids> query > query.txt

4) Share the result with us

 ceph-post-file osdmap pgs.txt query.txt

(or attach it to an email).

Thanks!
sage

> 
> It is a big cluster and we have just started . We will need the
> community help to make it run smoothly and expand further :)
> 
> Thanks
> Gaurav
> 
> On Thu, May 19, 2016 at 1:17 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Thu, 19 May 2016, Gaurav Bafna wrote:
> >> Hi Cephers ,
> >>
> >> In our production cluster at Reliance Jio, when as osd goes corrupt
> >> and crashes, Cluster remains unhealthy even after 4 hours.
> >>
> >>     cluster fac04d85-db48-4564-b821-deebda046261
> >>      health HEALTH_WARN
> >>             658 pgs degraded
> >>             658 pgs stuck degraded
> >>             688 pgs stuck unclean
> >>             658 pgs stuck undersized
> >>             658 pgs undersized
> >               ^^^ this...
> >
> >>             recovery 3064/1981308 objects degraded (0.155%)
> >>             recovery 124/1981308 objects misplaced (0.006%)
> >>      monmap e11: 11 mons at
> >> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0}
> >>             election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10
> >> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6
> >>      osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs
> >                                doesn't match this ^^
> >
> > which makes it look like a problem with OSDs reporting PG state to the
> > mon.  The fact that an OSD restarts supports that theory.
> >
> > What version is this?  A bunch of the osd -> mon pg reporting code was
> > recently rewritten (between infernalis and jewel), so the new code is
> > hopefully more robust.  (OTOH, it is also new, so we may have missed
> > something.)
> >
> > Nice big cluster!
> >
> > sage
> 
> 
> 
> -- 
> Gaurav Bafna
> 9540631400
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html