Re: Ceph Recovery

Gaurav Bafna <bafnag@xxxxxxxxx> · Tue, 17 May 2016 11:19:20 +0530

Hi Lazuardi

No, there are no unfound or incomplete PGs.

Replacing the osds surely makes the cluster health. But the problem
should not have occurred in the first place. The cluster should have
automatically healed after the OSDs were marked out of the cluster .
Else this will be a manual process for us every time the disk fails
which is very regular.

Thanks
Gaurav

On Tue, May 17, 2016 at 11:06 AM, Lazuardi Nasution
<mrxlazuardin@xxxxxxxxx> wrote:
> Gaurav,
>
> Is there any unfound or incomplete PGs? If not, you can remove OSD (with
> monitoring ceph -w and ceph -s output) and then replace it with good one,
> one by one OSD. I have done with that successfully.
>
> Best regards,
>
> On Tue, May 17, 2016 at 12:30 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote:
>>
>> Even I faced the same issue with our production cluster .
>>
>>     cluster fac04d85-db48-4564-b821-deebda046261
>>      health HEALTH_WARN
>>             658 pgs degraded
>>             658 pgs stuck degraded
>>             688 pgs stuck unclean
>>             658 pgs stuck undersized
>>             658 pgs undersized
>>             recovery 3064/1981308 objects degraded (0.155%)
>>             recovery 124/1981308 objects misplaced (0.006%)
>>      monmap e11: 11 mons at
>>
>> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0}
>>             election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10
>>
>> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6
>>      osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs
>>       pgmap v2740957: 75680 pgs, 11 pools, 386 GB data, 322 kobjects
>>             16288 GB used, 14299 TB / 14315 TB avail
>>             3064/1981308 objects degraded (0.155%)
>>             124/1981308 objects misplaced (0.006%)
>>                74992 active+clean
>>                  658 active+undersized+degraded
>>                   30 active+remapped
>>   client io 12394 B/s rd, 17 op/s
>>
>> With 12 osd are down due to H/W failure, and having replication factor
>> 6 , the cluster should have recovered , but it is not recovering.
>>
>> When I kill an osd daemon, it recovers quickly. Any ideas why the PGs
>> are remaining undersized ?
>>
>> What could be the difference between two scenarions :
>>
>> 1. OSD down due to H/W failure.
>> 2. OSD daemon killed .
>>
>> When I remove the 12 osds from the crushmap manually or do ceph osd
>> crush remove for those osds, the cluster recovers just fine.
>>
>> Thanks
>> Gaurav
>>
>> On Tue, May 17, 2016 at 2:08 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> >
>> >> Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution
>> >> <mrxlazuardin@xxxxxxxxx>:
>> >>
>> >>
>> >> Hi Wido,
>> >>
>> >> Yes you are right. After removing the down OSDs, reformatting and bring
>> >> them up again, at least until 75% of total OSDs, my Ceph Cluster is
>> >> healthy
>> >> again. It seem there is high probability of data safety if the total
>> >> active
>> >> PGs same with total PGs and total degraded PGs same with total
>> >> undersized
>> >> PGs, but it is better to check PGs one by one for make sure there is no
>> >> incomplete, unfound and/or missing objects.
>> >>
>> >> Anyway, why 75%? Can I reduce this value by resizing (add) the replica
>> >> of
>> >> the pool?
>> >>
>> >
>> > It completely depends on the CRUSHMap how many OSDs have to be added
>> > back to allow the cluster to recover.
>> >
>> > A CRUSHmap has failure domains which is usually a host. You have to make
>> > sure you have enough 'hosts' online with OSDs for each replica.
>> >
>> > So with 3 replicas you need 3 hosts online with OSDs on there.
>> >
>> > You can lower the replica count of a pool (size), but that makes it more
>> > vulnerable to data loss.
>> >
>> > Wido
>> >
>> >> Best regards,
>> >>
>> >> On Fri, May 13, 2016 at 5:04 PM, Wido den Hollander <wido@xxxxxxxx>
>> >> wrote:
>> >>
>> >> >
>> >> > > Op 13 mei 2016 om 11:55 schreef Lazuardi Nasution <
>> >> > mrxlazuardin@xxxxxxxxx>:
>> >> > >
>> >> > >
>> >> > > Hi Wido,
>> >> > >
>> >> > > The status is same after 24 hour running. It seem that the status
>> >> > > will
>> >> > not
>> >> > > go to fully active+clean until all down OSDs back again. The only
>> >> > > way to
>> >> > > make down OSDs to go back again is reformating or replace if HDDs
>> >> > > has
>> >> > > hardware issue. Do you think that it is safe way to do?
>> >> > >
>> >> >
>> >> > Ah, you are probably lacking enough replicas to make the recovery
>> >> > proceed.
>> >> >
>> >> > If that is needed I would do this OSD by OSD. Your crushmap will
>> >> > probably
>> >> > tell you which OSDs you need to bring back before it works again.
>> >> >
>> >> > Wido
>> >> >
>> >> > > Best regards,
>> >> > >
>> >> > > On Fri, May 13, 2016 at 4:44 PM, Wido den Hollander <wido@xxxxxxxx>
>> >> > wrote:
>> >> > >
>> >> > > >
>> >> > > > > Op 13 mei 2016 om 11:34 schreef Lazuardi Nasution <
>> >> > > > mrxlazuardin@xxxxxxxxx>:
>> >> > > > >
>> >> > > > >
>> >> > > > > Hi,
>> >> > > > >
>> >> > > > > After disaster and restarting for automatic recovery, I found
>> >> > following
>> >> > > > > ceph status. Some OSDs cannot be restarted due to file system
>> >> > corruption
>> >> > > > > (it seem that xfs is fragile).
>> >> > > > >
>> >> > > > > [root@management-b ~]# ceph status
>> >> > > > >     cluster 3810e9eb-9ece-4804-8c56-b986e7bb5627
>> >> > > > >      health HEALTH_WARN
>> >> > > > >             209 pgs degraded
>> >> > > > >             209 pgs stuck degraded
>> >> > > > >             334 pgs stuck unclean
>> >> > > > >             209 pgs stuck undersized
>> >> > > > >             209 pgs undersized
>> >> > > > >             recovery 5354/77810 objects degraded (6.881%)
>> >> > > > >             recovery 1105/77810 objects misplaced (1.420%)
>> >> > > > >      monmap e1: 3 mons at {management-a=
>> >> > > > >
>> >> > > >
>> >> >
>> >> > 10.255.102.1:6789/0,management-b=10.255.102.2:6789/0,management-c=10.255.102.3:6789/0
>> >> > > > > }
>> >> > > > >             election epoch 2308, quorum 0,1,2
>> >> > > > > management-a,management-b,management-c
>> >> > > > >      osdmap e25037: 96 osds: 49 up, 49 in; 125 remapped pgs
>> >> > > > >             flags sortbitwise
>> >> > > > >       pgmap v9024253: 2560 pgs, 5 pools, 291 GB data, 38905
>> >> > > > > objects
>> >> > > > >             678 GB used, 90444 GB / 91123 GB avail
>> >> > > > >             5354/77810 objects degraded (6.881%)
>> >> > > > >             1105/77810 objects misplaced (1.420%)
>> >> > > > >                 2226 active+clean
>> >> > > > >                  209 active+undersized+degraded
>> >> > > > >                  125 active+remapped
>> >> > > > >   client io 0 B/s rd, 282 kB/s wr, 10 op/s
>> >> > > > >
>> >> > > > > Since total active PGs same with total PGs and total degraded
>> >> > > > > PGs
>> >> > same
>> >> > > > with
>> >> > > > > total undersized PGs, does it mean that all PGs have at least
>> >> > > > > one
>> >> > good
>> >> > > > > replica, so I can just mark lost or remove down OSD, reformat
>> >> > > > > again
>> >> > and
>> >> > > > > then restart them if there is no hardware issue with HDDs?
>> >> > > > > Which one
>> >> > of
>> >> > > > PGs
>> >> > > > > status should I pay more attention, degraded or undersized due
>> >> > > > > to
>> >> > lost
>> >> > > > > object possibility?
>> >> > > > >
>> >> > > >
>> >> > > > Yes. Your system is not reporting any inactive, unfound or stale
>> >> > > > PGs,
>> >> > so
>> >> > > > that is good news.
>> >> > > >
>> >> > > > However, I recommend that you wait for the system to become fully
>> >> > > > active+clean before you start removing any OSDs or formatting
>> >> > > > hard
>> >> > drives.
>> >> > > > Better be safe than sorry.
>> >> > > >
>> >> > > > Wido
>> >> > > >
>> >> > > > > Best regards,
>> >> > > > > _______________________________________________
>> >> > > > > ceph-users mailing list
>> >> > > > > ceph-users@xxxxxxxxxxxxxx
>> >> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> > > >
>> >> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Gaurav Bafna
>> 9540631400
>
>

-- 
Gaurav Bafna
9540631400
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com