You mean that you never see recovery without crush map removal ? That is strange. I see quick recovery in our two small clusters and even in our production when a daemon is killed. It's only when as osd crashes, I don't see recovery in production. Let me talk to ceph-devel community to find whether this is an issue or not. Thanks On Wed, May 18, 2016 at 9:37 PM, Lazuardi Nasution <mrxlazuardin@xxxxxxxxx> wrote: > Hi Gaurav, > > It could be an issue. But, I never see crush map removal without recovery. > > Best regards, > > On Wed, May 18, 2016 at 1:41 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote: >> >> Is it a known issue and is it expected ? >> >> When as osd is marked out, the reweight becomes 0 and the PGs should >> get remapped , right ? >> >> I do see recovery after removing from crush map. >> >> Thanks >> Gaurav >> >> On Wed, May 18, 2016 at 12:08 PM, Lazuardi Nasution >> <mrxlazuardin@xxxxxxxxx> wrote: >> > Hi Gaurav, >> > >> > Not onnly marked out, you need to remove it from crush map to make sure >> > cluster do auto recovery. It seem taht the marked out OSD still appear >> > on >> > crush map calculation so it must be removed manually. You will see that >> > there will be recovery process after you remove OSD from crush map. >> > >> > Best regards, >> > >> > On Tue, May 17, 2016 at 12:49 PM, Gaurav Bafna <bafnag@xxxxxxxxx> wrote: >> >> >> >> Hi Lazuardi >> >> >> >> No, there are no unfound or incomplete PGs. >> >> >> >> Replacing the osds surely makes the cluster health. But the problem >> >> should not have occurred in the first place. The cluster should have >> >> automatically healed after the OSDs were marked out of the cluster . >> >> Else this will be a manual process for us every time the disk fails >> >> which is very regular. >> >> >> >> Thanks >> >> Gaurav >> >> >> >> On Tue, May 17, 2016 at 11:06 AM, Lazuardi Nasution >> >> <mrxlazuardin@xxxxxxxxx> wrote: >> >> > Gaurav, >> >> > >> >> > Is there any unfound or incomplete PGs? If not, you can remove OSD >> >> > (with >> >> > monitoring ceph -w and ceph -s output) and then replace it with good >> >> > one, >> >> > one by one OSD. I have done with that successfully. >> >> > >> >> > Best regards, >> >> > >> >> > On Tue, May 17, 2016 at 12:30 PM, Gaurav Bafna <bafnag@xxxxxxxxx> >> >> > wrote: >> >> >> >> >> >> Even I faced the same issue with our production cluster . >> >> >> >> >> >> cluster fac04d85-db48-4564-b821-deebda046261 >> >> >> health HEALTH_WARN >> >> >> 658 pgs degraded >> >> >> 658 pgs stuck degraded >> >> >> 688 pgs stuck unclean >> >> >> 658 pgs stuck undersized >> >> >> 658 pgs undersized >> >> >> recovery 3064/1981308 objects degraded (0.155%) >> >> >> recovery 124/1981308 objects misplaced (0.006%) >> >> >> monmap e11: 11 mons at >> >> >> >> >> >> >> >> >> >> >> >> {dssmon2=10.140.208.224:6789/0,dssmon3=10.140.208.225:6789/0,dssmon31=10.135.38.141:6789/0,dssmon32=10.135.38.142:6789/0,dssmon33=10.135.38.143:6789/0,dssmon34=10.135.38.144:6789/0,dssmon35=10.135.38.145:6789/0,dssmon4=10.140.208.226:6789/0,dssmon5=10.140.208.227:6789/0,dssmon6=10.140.208.228:6789/0,dssmonleader1=10.140.208.223:6789/0} >> >> >> election epoch 792, quorum 0,1,2,3,4,5,6,7,8,9,10 >> >> >> >> >> >> >> >> >> >> >> >> dssmon31,dssmon32,dssmon33,dssmon34,dssmon35,dssmonleader1,dssmon2,dssmon3,dssmon4,dssmon5,dssmon6 >> >> >> osdmap e8778: 2774 osds: 2746 up, 2746 in; 30 remapped pgs >> >> >> pgmap v2740957: 75680 pgs, 11 pools, 386 GB data, 322 kobjects >> >> >> 16288 GB used, 14299 TB / 14315 TB avail >> >> >> 3064/1981308 objects degraded (0.155%) >> >> >> 124/1981308 objects misplaced (0.006%) >> >> >> 74992 active+clean >> >> >> 658 active+undersized+degraded >> >> >> 30 active+remapped >> >> >> client io 12394 B/s rd, 17 op/s >> >> >> >> >> >> With 12 osd are down due to H/W failure, and having replication >> >> >> factor >> >> >> 6 , the cluster should have recovered , but it is not recovering. >> >> >> >> >> >> When I kill an osd daemon, it recovers quickly. Any ideas why the >> >> >> PGs >> >> >> are remaining undersized ? >> >> >> >> >> >> What could be the difference between two scenarions : >> >> >> >> >> >> 1. OSD down due to H/W failure. >> >> >> 2. OSD daemon killed . >> >> >> >> >> >> When I remove the 12 osds from the crushmap manually or do ceph osd >> >> >> crush remove for those osds, the cluster recovers just fine. >> >> >> >> >> >> Thanks >> >> >> Gaurav >> >> >> >> >> >> On Tue, May 17, 2016 at 2:08 AM, Wido den Hollander <wido@xxxxxxxx> >> >> >> wrote: >> >> >> > >> >> >> >> Op 14 mei 2016 om 12:36 schreef Lazuardi Nasution >> >> >> >> <mrxlazuardin@xxxxxxxxx>: >> >> >> >> >> >> >> >> >> >> >> >> Hi Wido, >> >> >> >> >> >> >> >> Yes you are right. After removing the down OSDs, reformatting and >> >> >> >> bring >> >> >> >> them up again, at least until 75% of total OSDs, my Ceph Cluster >> >> >> >> is >> >> >> >> healthy >> >> >> >> again. It seem there is high probability of data safety if the >> >> >> >> total >> >> >> >> active >> >> >> >> PGs same with total PGs and total degraded PGs same with total >> >> >> >> undersized >> >> >> >> PGs, but it is better to check PGs one by one for make sure there >> >> >> >> is >> >> >> >> no >> >> >> >> incomplete, unfound and/or missing objects. >> >> >> >> >> >> >> >> Anyway, why 75%? Can I reduce this value by resizing (add) the >> >> >> >> replica >> >> >> >> of >> >> >> >> the pool? >> >> >> >> >> >> >> > >> >> >> > It completely depends on the CRUSHMap how many OSDs have to be >> >> >> > added >> >> >> > back to allow the cluster to recover. >> >> >> > >> >> >> > A CRUSHmap has failure domains which is usually a host. You have >> >> >> > to >> >> >> > make >> >> >> > sure you have enough 'hosts' online with OSDs for each replica. >> >> >> > >> >> >> > So with 3 replicas you need 3 hosts online with OSDs on there. >> >> >> > >> >> >> > You can lower the replica count of a pool (size), but that makes >> >> >> > it >> >> >> > more >> >> >> > vulnerable to data loss. >> >> >> > >> >> >> > Wido >> >> >> > >> >> >> >> Best regards, >> >> >> >> >> >> >> >> On Fri, May 13, 2016 at 5:04 PM, Wido den Hollander >> >> >> >> <wido@xxxxxxxx> >> >> >> >> wrote: >> >> >> >> >> >> >> >> > >> >> >> >> > > Op 13 mei 2016 om 11:55 schreef Lazuardi Nasution < >> >> >> >> > mrxlazuardin@xxxxxxxxx>: >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > Hi Wido, >> >> >> >> > > >> >> >> >> > > The status is same after 24 hour running. It seem that the >> >> >> >> > > status >> >> >> >> > > will >> >> >> >> > not >> >> >> >> > > go to fully active+clean until all down OSDs back again. The >> >> >> >> > > only >> >> >> >> > > way to >> >> >> >> > > make down OSDs to go back again is reformating or replace if >> >> >> >> > > HDDs >> >> >> >> > > has >> >> >> >> > > hardware issue. Do you think that it is safe way to do? >> >> >> >> > > >> >> >> >> > >> >> >> >> > Ah, you are probably lacking enough replicas to make the >> >> >> >> > recovery >> >> >> >> > proceed. >> >> >> >> > >> >> >> >> > If that is needed I would do this OSD by OSD. Your crushmap >> >> >> >> > will >> >> >> >> > probably >> >> >> >> > tell you which OSDs you need to bring back before it works >> >> >> >> > again. >> >> >> >> > >> >> >> >> > Wido >> >> >> >> > >> >> >> >> > > Best regards, >> >> >> >> > > >> >> >> >> > > On Fri, May 13, 2016 at 4:44 PM, Wido den Hollander >> >> >> >> > > <wido@xxxxxxxx> >> >> >> >> > wrote: >> >> >> >> > > >> >> >> >> > > > >> >> >> >> > > > > Op 13 mei 2016 om 11:34 schreef Lazuardi Nasution < >> >> >> >> > > > mrxlazuardin@xxxxxxxxx>: >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> >> > > > > Hi, >> >> >> >> > > > > >> >> >> >> > > > > After disaster and restarting for automatic recovery, I >> >> >> >> > > > > found >> >> >> >> > following >> >> >> >> > > > > ceph status. Some OSDs cannot be restarted due to file >> >> >> >> > > > > system >> >> >> >> > corruption >> >> >> >> > > > > (it seem that xfs is fragile). >> >> >> >> > > > > >> >> >> >> > > > > [root@management-b ~]# ceph status >> >> >> >> > > > > cluster 3810e9eb-9ece-4804-8c56-b986e7bb5627 >> >> >> >> > > > > health HEALTH_WARN >> >> >> >> > > > > 209 pgs degraded >> >> >> >> > > > > 209 pgs stuck degraded >> >> >> >> > > > > 334 pgs stuck unclean >> >> >> >> > > > > 209 pgs stuck undersized >> >> >> >> > > > > 209 pgs undersized >> >> >> >> > > > > recovery 5354/77810 objects degraded (6.881%) >> >> >> >> > > > > recovery 1105/77810 objects misplaced >> >> >> >> > > > > (1.420%) >> >> >> >> > > > > monmap e1: 3 mons at {management-a= >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > 10.255.102.1:6789/0,management-b=10.255.102.2:6789/0,management-c=10.255.102.3:6789/0 >> >> >> >> > > > > } >> >> >> >> > > > > election epoch 2308, quorum 0,1,2 >> >> >> >> > > > > management-a,management-b,management-c >> >> >> >> > > > > osdmap e25037: 96 osds: 49 up, 49 in; 125 remapped >> >> >> >> > > > > pgs >> >> >> >> > > > > flags sortbitwise >> >> >> >> > > > > pgmap v9024253: 2560 pgs, 5 pools, 291 GB data, >> >> >> >> > > > > 38905 >> >> >> >> > > > > objects >> >> >> >> > > > > 678 GB used, 90444 GB / 91123 GB avail >> >> >> >> > > > > 5354/77810 objects degraded (6.881%) >> >> >> >> > > > > 1105/77810 objects misplaced (1.420%) >> >> >> >> > > > > 2226 active+clean >> >> >> >> > > > > 209 active+undersized+degraded >> >> >> >> > > > > 125 active+remapped >> >> >> >> > > > > client io 0 B/s rd, 282 kB/s wr, 10 op/s >> >> >> >> > > > > >> >> >> >> > > > > Since total active PGs same with total PGs and total >> >> >> >> > > > > degraded >> >> >> >> > > > > PGs >> >> >> >> > same >> >> >> >> > > > with >> >> >> >> > > > > total undersized PGs, does it mean that all PGs have at >> >> >> >> > > > > least >> >> >> >> > > > > one >> >> >> >> > good >> >> >> >> > > > > replica, so I can just mark lost or remove down OSD, >> >> >> >> > > > > reformat >> >> >> >> > > > > again >> >> >> >> > and >> >> >> >> > > > > then restart them if there is no hardware issue with >> >> >> >> > > > > HDDs? >> >> >> >> > > > > Which one >> >> >> >> > of >> >> >> >> > > > PGs >> >> >> >> > > > > status should I pay more attention, degraded or >> >> >> >> > > > > undersized >> >> >> >> > > > > due >> >> >> >> > > > > to >> >> >> >> > lost >> >> >> >> > > > > object possibility? >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > > Yes. Your system is not reporting any inactive, unfound or >> >> >> >> > > > stale >> >> >> >> > > > PGs, >> >> >> >> > so >> >> >> >> > > > that is good news. >> >> >> >> > > > >> >> >> >> > > > However, I recommend that you wait for the system to become >> >> >> >> > > > fully >> >> >> >> > > > active+clean before you start removing any OSDs or >> >> >> >> > > > formatting >> >> >> >> > > > hard >> >> >> >> > drives. >> >> >> >> > > > Better be safe than sorry. >> >> >> >> > > > >> >> >> >> > > > Wido >> >> >> >> > > > >> >> >> >> > > > > Best regards, >> >> >> >> > > > > _______________________________________________ >> >> >> >> > > > > ceph-users mailing list >> >> >> >> > > > > ceph-users@xxxxxxxxxxxxxx >> >> >> >> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> > > > >> >> >> >> > >> >> >> > _______________________________________________ >> >> >> > ceph-users mailing list >> >> >> > ceph-users@xxxxxxxxxxxxxx >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Gaurav Bafna >> >> >> 9540631400 >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Gaurav Bafna >> >> 9540631400 >> > >> > >> >> >> >> -- >> Gaurav Bafna >> 9540631400 > > -- Gaurav Bafna 9540631400 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com