Re: Ceph Recovery

Gaurav Bafna <bafnag@xxxxxxxxx> · Thu, 19 May 2016 12:27:13 +0530

You mean that you never see recovery without crush map removal ? That
is strange. I see quick recovery in our two small clusters and even in
our production when a daemon is killed.

It's only when as osd crashes, I don't see recovery in production.

Let me talk to ceph-devel community to find whether this is an issue or not.

Thanks

On Wed, May 18, 2016 at 9:37 PM, Lazuardi Nasution
<mrxlazuardin@xxxxxxxxx> wrote:
> Hi Gaurav,
>
> It could be an issue. But, I never see crush map removal without recovery.
>
> Best regards,
>
> On Wed, May 18, 2016 at 1:41 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
>>
>> Is it a known issue and is it expected ?
>>
>> When as osd is marked out, the reweight becomes 0 and the PGs should
>> get remapped , right ?
>>
>> I do see recovery after removing from crush map.
>>
>> Thanks
>> Gaurav
>>
>> On Wed, May 18, 2016 at 12:08 PM, Lazuardi Nasution
>> <mrxlazuardin@xxxxxxxxx> wrote:
>> > Hi Gaurav,
>> >
>> > Not onnly marked out, you need to remove it from crush map to make sure
>> > cluster do auto recovery. It seem taht the marked out OSD still appear
>> > on
>> > crush map calculation so it must be removed manually. You will see that
>> > there will be recovery process after you remove OSD from crush map.
>> >
>> > Best regards,
>> >
>> > On Tue, May 17, 2016 at 12:49 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
>> >>
>> >> Hi Lazuardi
>> >>
>> >> No, there are no unfound or incomplete PGs.
>> >>
>> >> Replacing the osds surely makes the cluster health. But the problem
>> >> should not have occurred in the first place. The cluster should have
>> >> automatically healed after the OSDs were marked out of the cluster .
>> >> Else this will be a manual process for us every time the disk fails
>> >> which is very regular.
>> >>
>> >> Thanks
>> >> Gaurav
>> >>
>> >> On Tue, May 17, 2016 at 11:06 AM, Lazuardi Nasution
>> >> <mrxlazuardin@xxxxxxxxx> wrote:
>> >> > Gaurav,
>> >> >
>> >> > Is there any unfound or incomplete PGs? If not, you can remove OSD
>> >> > (with
>> >> > monitoring ceph -w and ceph -s output) and then replace it with good
>> >> > one,
>> >> > one by one OSD. I have done with that successfully.
>> >> >
>> >> > Best regards,
>> >> >
>> >> > On Tue, May 17, 2016 at 12:30 PM, Gaurav Bafna <bafnag@xxxxxxxxx>
>> >> > wrote:
>> >> >>
>> >> >> Even I faced the same issue with our production cluster .
>> >> >>
>> >> >>     cluster fac04d85-db48-4564-b821-deebda046261
>> >> >>      health HEALTH_WARN
>> >> >>             658 pgs degraded
>> >> >>             658 pgs stuck degraded
>> >> >>             688 pgs stuck unclean
>> >> >>             658 pgs stuck undersized
>> >> >>             658 pgs undersized
>> >> >>             recovery 3064/1981308 objects degraded (0.155%)
>> >> >>             recovery 124/1981308 objects misplaced (0.006%)
>> >> >>      monmap e11: 11 mons at
>> >> >>
>> >> >>
>> >> >>
>> >> >> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0}
>> >> >>             election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10
>> >> >>
>> >> >>
>> >> >>
>> >> >> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6
>> >> >>      osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs
>> >> >>       pgmap v2740957: 75680 pgs, 11 pools, 386 GB data, 322 kobjects
>> >> >>             16288 GB used, 14299 TB / 14315 TB avail
>> >> >>             3064/1981308 objects degraded (0.155%)
>> >> >>             124/1981308 objects misplaced (0.006%)
>> >> >>                74992 active+clean
>> >> >>                  658 active+undersized+degraded
>> >> >>                   30 active+remapped
>> >> >>   client io 12394 B/s rd, 17 op/s
>> >> >>
>> >> >> With 12 osd are down due to H/W failure, and having replication
>> >> >> factor
>> >> >> 6 , the cluster should have recovered , but it is not recovering.
>> >> >>
>> >> >> When I kill an osd daemon, it recovers quickly. Any ideas why the
>> >> >> PGs
>> >> >> are remaining undersized ?
>> >> >>
>> >> >> What could be the difference between two scenarions :
>> >> >>
>> >> >> 1. OSD down due to H/W failure.
>> >> >> 2. OSD daemon killed .
>> >> >>
>> >> >> When I remove the 12 osds from the crushmap manually or do ceph osd
>> >> >> crush remove for those osds, the cluster recovers just fine.
>> >> >>
>> >> >> Thanks
>> >> >> Gaurav
>> >> >>
>> >> >> On Tue, May 17, 2016 at 2:08 AM, Wido den Hollander <wido@xxxxxxxx>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution
>> >> >> >> <mrxlazuardin@xxxxxxxxx>:
>> >> >> >>
>> >> >> >>
>> >> >> >> Hi Wido,
>> >> >> >>
>> >> >> >> Yes you are right. After removing the down OSDs, reformatting and
>> >> >> >> bring
>> >> >> >> them up again, at least until 75% of total OSDs, my Ceph Cluster
>> >> >> >> is
>> >> >> >> healthy
>> >> >> >> again. It seem there is high probability of data safety if the
>> >> >> >> total
>> >> >> >> active
>> >> >> >> PGs same with total PGs and total degraded PGs same with total
>> >> >> >> undersized
>> >> >> >> PGs, but it is better to check PGs one by one for make sure there
>> >> >> >> is
>> >> >> >> no
>> >> >> >> incomplete, unfound and/or missing objects.
>> >> >> >>
>> >> >> >> Anyway, why 75%? Can I reduce this value by resizing (add) the
>> >> >> >> replica
>> >> >> >> of
>> >> >> >> the pool?
>> >> >> >>
>> >> >> >
>> >> >> > It completely depends on the CRUSHMap how many OSDs have to be
>> >> >> > added
>> >> >> > back to allow the cluster to recover.
>> >> >> >
>> >> >> > A CRUSHmap has failure domains which is usually a host. You have
>> >> >> > to
>> >> >> > make
>> >> >> > sure you have enough 'hosts' online with OSDs for each replica.
>> >> >> >
>> >> >> > So with 3 replicas you need 3 hosts online with OSDs on there.
>> >> >> >
>> >> >> > You can lower the replica count of a pool (size), but that makes
>> >> >> > it
>> >> >> > more
>> >> >> > vulnerable to data loss.
>> >> >> >
>> >> >> > Wido
>> >> >> >
>> >> >> >> Best regards,
>> >> >> >>
>> >> >> >> On Fri, May 13, 2016 at 5:04 PM, Wido den Hollander
>> >> >> >> <wido@xxxxxxxx>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >> >
>> >> >> >> > > Op 13 mei 2016 om 11:55 schreef Lazuardi Nasution <
>> >> >> >> > mrxlazuardin@xxxxxxxxx>:
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > Hi Wido,
>> >> >> >> > >
>> >> >> >> > > The status is same after 24 hour running. It seem that the
>> >> >> >> > > status
>> >> >> >> > > will
>> >> >> >> > not
>> >> >> >> > > go to fully active+clean until all down OSDs back again. The
>> >> >> >> > > only
>> >> >> >> > > way to
>> >> >> >> > > make down OSDs to go back again is reformating or replace if
>> >> >> >> > > HDDs
>> >> >> >> > > has
>> >> >> >> > > hardware issue. Do you think that it is safe way to do?
>> >> >> >> > >
>> >> >> >> >
>> >> >> >> > Ah, you are probably lacking enough replicas to make the
>> >> >> >> > recovery
>> >> >> >> > proceed.
>> >> >> >> >
>> >> >> >> > If that is needed I would do this OSD by OSD. Your crushmap
>> >> >> >> > will
>> >> >> >> > probably
>> >> >> >> > tell you which OSDs you need to bring back before it works
>> >> >> >> > again.
>> >> >> >> >
>> >> >> >> > Wido
>> >> >> >> >
>> >> >> >> > > Best regards,
>> >> >> >> > >
>> >> >> >> > > On Fri, May 13, 2016 at 4:44 PM, Wido den Hollander
>> >> >> >> > > <wido@xxxxxxxx>
>> >> >> >> > wrote:
>> >> >> >> > >
>> >> >> >> > > >
>> >> >> >> > > > > Op 13 mei 2016 om 11:34 schreef Lazuardi Nasution <
>> >> >> >> > > > mrxlazuardin@xxxxxxxxx>:
>> >> >> >> > > > >
>> >> >> >> > > > >
>> >> >> >> > > > > Hi,
>> >> >> >> > > > >
>> >> >> >> > > > > After disaster and restarting for automatic recovery, I
>> >> >> >> > > > > found
>> >> >> >> > following
>> >> >> >> > > > > ceph status. Some OSDs cannot be restarted due to file
>> >> >> >> > > > > system
>> >> >> >> > corruption
>> >> >> >> > > > > (it seem that xfs is fragile).
>> >> >> >> > > > >
>> >> >> >> > > > > [root@management-b ~]# ceph status
>> >> >> >> > > > >     cluster 3810e9eb-9ece-4804-8c56-b986e7bb5627
>> >> >> >> > > > >      health HEALTH_WARN
>> >> >> >> > > > >             209 pgs degraded
>> >> >> >> > > > >             209 pgs stuck degraded
>> >> >> >> > > > >             334 pgs stuck unclean
>> >> >> >> > > > >             209 pgs stuck undersized
>> >> >> >> > > > >             209 pgs undersized
>> >> >> >> > > > >             recovery 5354/77810 objects degraded (6.881%)
>> >> >> >> > > > >             recovery 1105/77810 objects misplaced
>> >> >> >> > > > > (1.420%)
>> >> >> >> > > > >      monmap e1: 3 mons at {management-a=
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > 10.255.102.1:6789/0,management-b=10.255.102.2:6789/0,management-c=10.255.102.3:6789/0
>> >> >> >> > > > > }
>> >> >> >> > > > >             election epoch 2308, quorum 0,1,2
>> >> >> >> > > > > management-a,management-b,management-c
>> >> >> >> > > > >      osdmap e25037: 96 osds: 49 up, 49 in; 125 remapped
>> >> >> >> > > > > pgs
>> >> >> >> > > > >             flags sortbitwise
>> >> >> >> > > > >       pgmap v9024253: 2560 pgs, 5 pools, 291 GB data,
>> >> >> >> > > > > 38905
>> >> >> >> > > > > objects
>> >> >> >> > > > >             678 GB used, 90444 GB / 91123 GB avail
>> >> >> >> > > > >             5354/77810 objects degraded (6.881%)
>> >> >> >> > > > >             1105/77810 objects misplaced (1.420%)
>> >> >> >> > > > >                 2226 active+clean
>> >> >> >> > > > >                  209 active+undersized+degraded
>> >> >> >> > > > >                  125 active+remapped
>> >> >> >> > > > >   client io 0 B/s rd, 282 kB/s wr, 10 op/s
>> >> >> >> > > > >
>> >> >> >> > > > > Since total active PGs same with total PGs and total
>> >> >> >> > > > > degraded
>> >> >> >> > > > > PGs
>> >> >> >> > same
>> >> >> >> > > > with
>> >> >> >> > > > > total undersized PGs, does it mean that all PGs have at
>> >> >> >> > > > > least
>> >> >> >> > > > > one
>> >> >> >> > good
>> >> >> >> > > > > replica, so I can just mark lost or remove down OSD,
>> >> >> >> > > > > reformat
>> >> >> >> > > > > again
>> >> >> >> > and
>> >> >> >> > > > > then restart them if there is no hardware issue with
>> >> >> >> > > > > HDDs?
>> >> >> >> > > > > Which one
>> >> >> >> > of
>> >> >> >> > > > PGs
>> >> >> >> > > > > status should I pay more attention, degraded or
>> >> >> >> > > > > undersized
>> >> >> >> > > > > due
>> >> >> >> > > > > to
>> >> >> >> > lost
>> >> >> >> > > > > object possibility?
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > > > Yes. Your system is not reporting any inactive, unfound or
>> >> >> >> > > > stale
>> >> >> >> > > > PGs,
>> >> >> >> > so
>> >> >> >> > > > that is good news.
>> >> >> >> > > >
>> >> >> >> > > > However, I recommend that you wait for the system to become
>> >> >> >> > > > fully
>> >> >> >> > > > active+clean before you start removing any OSDs or
>> >> >> >> > > > formatting
>> >> >> >> > > > hard
>> >> >> >> > drives.
>> >> >> >> > > > Better be safe than sorry.
>> >> >> >> > > >
>> >> >> >> > > > Wido
>> >> >> >> > > >
>> >> >> >> > > > > Best regards,
>> >> >> >> > > > > _______________________________________________
>> >> >> >> > > > > ceph-users mailing list
>> >> >> >> > > > > ceph-users@xxxxxxxxxxxxxx
>> >> >> >> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >> > > >
>> >> >> >> >
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@xxxxxxxxxxxxxx
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Gaurav Bafna
>> >> >> 9540631400
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Gaurav Bafna
>> >> 9540631400
>> >
>> >
>>
>>
>>
>> --
>> Gaurav Bafna
>> 9540631400
>
>

-- 
Gaurav Bafna
9540631400
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com