Re: Ceph Recovery

Lazuardi Nasution <mrxlazuardin@xxxxxxxxx> · Wed, 18 May 2016 23:07:47 +0700

Hi Gaurav,
It could be an issue. But, I never see crush map removal without recovery.

Best regards,

On Wed, May 18, 2016 at 1:41 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
Is it a known issue and is it expected ?

When as osd is marked out, the reweight becomes 0 and the PGs should

get remapped , right ?

I do see recovery after removing from crush map.

Thanks

Gaurav

On Wed, May 18, 2016 at 12:08 PM, Lazuardi Nasution

<mrxlazuardin@xxxxxxxxx> wrote:

> Hi Gaurav,

>

> Not onnly marked out, you need to remove it from crush map to make sure

> cluster do auto recovery. It seem taht the marked out OSD still appear on

> crush map calculation so it must be removed manually. You will see that

> there will be recovery process after you remove OSD from crush map.

>

> Best regards,

>

> On Tue, May 17, 2016 at 12:49 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:

>>

>> Hi Lazuardi

>>

>> No, there are no unfound or incomplete PGs.

>>

>> Replacing the osds surely makes the cluster health. But the problem

>> should not have occurred in the first place. The cluster should have

>> automatically healed after the OSDs were marked out of the cluster .

>> Else this will be a manual process for us every time the disk fails

>> which is very regular.

>>

>> Thanks

>> Gaurav

>>

>> On Tue, May 17, 2016 at 11:06 AM, Lazuardi Nasution

>> <mrxlazuardin@xxxxxxxxx> wrote:

>> > Gaurav,

>> >

>> > Is there any unfound or incomplete PGs? If not, you can remove OSD (with

>> > monitoring ceph -w and ceph -s output) and then replace it with good

>> > one,

>> > one by one OSD. I have done with that successfully.

>> >

>> > Best regards,

>> >

>> > On Tue, May 17, 2016 at 12:30 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:

>> >>

>> >> Even I faced the same issue with our production cluster .

>> >>

>> >>     cluster fac04d85-db48-4564-b821-deebda046261

>> >>      health HEALTH_WARN

>> >>             658 pgs degraded

>> >>             658 pgs stuck degraded

>> >>             688 pgs stuck unclean

>> >>             658 pgs stuck undersized

>> >>             658 pgs undersized

>> >>             recovery 3064/1981308 objects degraded (0.155%)

>> >>             recovery 124/1981308 objects misplaced (0.006%)

>> >>      monmap e11: 11 mons at

>> >>

>> >>

>> >> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0}

>> >>             election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10

>> >>

>> >>

>> >> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6

>> >>      osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs

>> >>       pgmap v2740957: 75680 pgs, 11 pools, 386 GB data, 322 kobjects

>> >>             16288 GB used, 14299 TB / 14315 TB avail

>> >>             3064/1981308 objects degraded (0.155%)

>> >>             124/1981308 objects misplaced (0.006%)

>> >>                74992 active+clean

>> >>                  658 active+undersized+degraded

>> >>                   30 active+remapped

>> >>   client io 12394 B/s rd, 17 op/s

>> >>

>> >> With 12 osd are down due to H/W failure, and having replication factor

>> >> 6 , the cluster should have recovered , but it is not recovering.

>> >>

>> >> When I kill an osd daemon, it recovers quickly. Any ideas why the PGs

>> >> are remaining undersized ?

>> >>

>> >> What could be the difference between two scenarions :

>> >>

>> >> 1. OSD down due to H/W failure.

>> >> 2. OSD daemon killed .

>> >>

>> >> When I remove the 12 osds from the crushmap manually or do ceph osd

>> >> crush remove for those osds, the cluster recovers just fine.

>> >>

>> >> Thanks

>> >> Gaurav

>> >>

>> >> On Tue, May 17, 2016 at 2:08 AM, Wido den Hollander <wido@xxxxxxxx>

>> >> wrote:

>> >> >

>> >> >> Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution

>> >> >> <mrxlazuardin@xxxxxxxxx>:

>> >> >>

>> >> >>

>> >> >> Hi Wido,

>> >> >>

>> >> >> Yes you are right. After removing the down OSDs, reformatting and

>> >> >> bring

>> >> >> them up again, at least until 75% of total OSDs, my Ceph Cluster is

>> >> >> healthy

>> >> >> again. It seem there is high probability of data safety if the total

>> >> >> active

>> >> >> PGs same with total PGs and total degraded PGs same with total

>> >> >> undersized

>> >> >> PGs, but it is better to check PGs one by one for make sure there is

>> >> >> no

>> >> >> incomplete, unfound and/or missing objects.

>> >> >>

>> >> >> Anyway, why 75%? Can I reduce this value by resizing (add) the

>> >> >> replica

>> >> >> of

>> >> >> the pool?

>> >> >>

>> >> >

>> >> > It completely depends on the CRUSHMap how many OSDs have to be added

>> >> > back to allow the cluster to recover.

>> >> >

>> >> > A CRUSHmap has failure domains which is usually a host. You have to

>> >> > make

>> >> > sure you have enough 'hosts' online with OSDs for each replica.

>> >> >

>> >> > So with 3 replicas you need 3 hosts online with OSDs on there.

>> >> >

>> >> > You can lower the replica count of a pool (size), but that makes it

>> >> > more

>> >> > vulnerable to data loss.

>> >> >

>> >> > Wido

>> >> >

>> >> >> Best regards,

>> >> >>

>> >> >> On Fri, May 13, 2016 at 5:04 PM, Wido den Hollander <wido@xxxxxxxx>

>> >> >> wrote:

>> >> >>

>> >> >> >

>> >> >> > > Op 13 mei 2016 om 11:55 schreef Lazuardi Nasution <

>> >> >> > mrxlazuardin@xxxxxxxxx>:

>> >> >> > >

>> >> >> > >

>> >> >> > > Hi Wido,

>> >> >> > >

>> >> >> > > The status is same after 24 hour running. It seem that the

>> >> >> > > status

>> >> >> > > will

>> >> >> > not

>> >> >> > > go to fully active+clean until all down OSDs back again. The

>> >> >> > > only

>> >> >> > > way to

>> >> >> > > make down OSDs to go back again is reformating or replace if

>> >> >> > > HDDs

>> >> >> > > has

>> >> >> > > hardware issue. Do you think that it is safe way to do?

>> >> >> > >

>> >> >> >

>> >> >> > Ah, you are probably lacking enough replicas to make the recovery

>> >> >> > proceed.

>> >> >> >

>> >> >> > If that is needed I would do this OSD by OSD. Your crushmap will

>> >> >> > probably

>> >> >> > tell you which OSDs you need to bring back before it works again.

>> >> >> >

>> >> >> > Wido

>> >> >> >

>> >> >> > > Best regards,

>> >> >> > >

>> >> >> > > On Fri, May 13, 2016 at 4:44 PM, Wido den Hollander

>> >> >> > > <wido@xxxxxxxx>

>> >> >> > wrote:

>> >> >> > >

>> >> >> > > >

>> >> >> > > > > Op 13 mei 2016 om 11:34 schreef Lazuardi Nasution <

>> >> >> > > > mrxlazuardin@xxxxxxxxx>:

>> >> >> > > > >

>> >> >> > > > >

>> >> >> > > > > Hi,

>> >> >> > > > >

>> >> >> > > > > After disaster and restarting for automatic recovery, I

>> >> >> > > > > found

>> >> >> > following

>> >> >> > > > > ceph status. Some OSDs cannot be restarted due to file

>> >> >> > > > > system

>> >> >> > corruption

>> >> >> > > > > (it seem that xfs is fragile).

>> >> >> > > > >

>> >> >> > > > > [root@management-b ~]# ceph status

>> >> >> > > > >     cluster 3810e9eb-9ece-4804-8c56-b986e7bb5627

>> >> >> > > > >      health HEALTH_WARN

>> >> >> > > > >             209 pgs degraded

>> >> >> > > > >             209 pgs stuck degraded

>> >> >> > > > >             334 pgs stuck unclean

>> >> >> > > > >             209 pgs stuck undersized

>> >> >> > > > >             209 pgs undersized

>> >> >> > > > >             recovery 5354/77810 objects degraded (6.881%)

>> >> >> > > > >             recovery 1105/77810 objects misplaced (1.420%)

>> >> >> > > > >      monmap e1: 3 mons at {management-a=

>> >> >> > > > >

>> >> >> > > >

>> >> >> >

>> >> >> >

>> >> >> > 10.255.102.1:6789/0,management-b=10.255.102.2:6789/0,management-c=10.255.102.3:6789/0

>> >> >> > > > > }

>> >> >> > > > >             election epoch 2308, quorum 0,1,2

>> >> >> > > > > management-a,management-b,management-c

>> >> >> > > > >      osdmap e25037: 96 osds: 49 up, 49 in; 125 remapped pgs

>> >> >> > > > >             flags sortbitwise

>> >> >> > > > >       pgmap v9024253: 2560 pgs, 5 pools, 291 GB data, 38905

>> >> >> > > > > objects

>> >> >> > > > >             678 GB used, 90444 GB / 91123 GB avail

>> >> >> > > > >             5354/77810 objects degraded (6.881%)

>> >> >> > > > >             1105/77810 objects misplaced (1.420%)

>> >> >> > > > >                 2226 active+clean

>> >> >> > > > >                  209 active+undersized+degraded

>> >> >> > > > >                  125 active+remapped

>> >> >> > > > >   client io 0 B/s rd, 282 kB/s wr, 10 op/s

>> >> >> > > > >

>> >> >> > > > > Since total active PGs same with total PGs and total

>> >> >> > > > > degraded

>> >> >> > > > > PGs

>> >> >> > same

>> >> >> > > > with

>> >> >> > > > > total undersized PGs, does it mean that all PGs have at

>> >> >> > > > > least

>> >> >> > > > > one

>> >> >> > good

>> >> >> > > > > replica, so I can just mark lost or remove down OSD,

>> >> >> > > > > reformat

>> >> >> > > > > again

>> >> >> > and

>> >> >> > > > > then restart them if there is no hardware issue with HDDs?

>> >> >> > > > > Which one

>> >> >> > of

>> >> >> > > > PGs

>> >> >> > > > > status should I pay more attention, degraded or undersized

>> >> >> > > > > due

>> >> >> > > > > to

>> >> >> > lost

>> >> >> > > > > object possibility?

>> >> >> > > > >

>> >> >> > > >

>> >> >> > > > Yes. Your system is not reporting any inactive, unfound or

>> >> >> > > > stale

>> >> >> > > > PGs,

>> >> >> > so

>> >> >> > > > that is good news.

>> >> >> > > >

>> >> >> > > > However, I recommend that you wait for the system to become

>> >> >> > > > fully

>> >> >> > > > active+clean before you start removing any OSDs or formatting

>> >> >> > > > hard

>> >> >> > drives.

>> >> >> > > > Better be safe than sorry.

>> >> >> > > >

>> >> >> > > > Wido

>> >> >> > > >

>> >> >> > > > > Best regards,

>> >> >> > > > > _______________________________________________

>> >> >> > > > > ceph-users mailing list

>> >> >> > > > > ceph-users@xxxxxxxxxxxxxx

>> >> >> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >> > > >

>> >> >> >

>> >> > _______________________________________________

>> >> > ceph-users mailing list

>> >> > ceph-users@xxxxxxxxxxxxxx

>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >>

>> >>

>> >>

>> >> --

>> >> Gaurav Bafna

>> >> 9540631400

>> >

>> >

>>

>>

>>

>> --

>> Gaurav Bafna

>> 9540631400

>

>

--

Gaurav Bafna

9540631400

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com