ceph -w output: health HEALTH_WARN 441 pgs degraded; 441 pgs stuck unclean; recovery 131518/1036770 objects degraded (12.685%); 4/31 in osds are down; noout flag(s) set 2014-09-05 11:36 GMT+08:00 Jason King <chn.kei at gmail.com>: > Hi, > > What's the status of your cluster after the node failure? > > Jason > > > 2014-09-04 21:33 GMT+08:00 Christian Balzer <chibi at gol.com>: > >> >> Hello, >> >> On Thu, 4 Sep 2014 20:56:31 +0800 Ding Dinghua wrote: >> >> Aside from what Loic wrote, why not replace the network controller or if >> it is onboard, add a card? >> >> > Hi all, >> > I'm new to ceph, and apologize if the question has been asked. >> > >> > I have setup a 8-nodes ceph cluster, and after two months >> > running, network controller of an node is broken, so I have to replace >> > the node with an new one. >> > I don't want to trigger data migration, since all I want to do >> is >> > replacing a node, not shrink the cluster and then enlarge the cluster. >> >> Well, you will have (had) data migration unless your cluster was set to >> noout from the start or had a "mon osd downout subtree limit" set >> accordingly. >> >> > I think the following steps may work: >> > 1) set osd_crush_update_on_start to false, so when osd starts, >> > it won't modify crushmap and trigger data migration. >> I think the noin flag might do that trick, too. >> >> > 2) set noout flags to prevent osds been kicked out of cluster >> > and trigger data migration >> Probably too late at this point... >> >> > 3) mark all osds on the broken node down(actually, since >> > network controller is broken, these osds are already down) >> And not out? >> >> Regards, >> >> Christian >> > 4) prepare osd on the new node, and keep osd_num the same >> with >> > the osd on the broken node: >> > ceph-osd -i [osd_num] --osd-data=path1 --mkfs >> > 5) start osd on the new node, and peering and backfilling work >> > will be started automaticlly >> > 6) wait until 5) complete, and repeat 4) and 5) until all osds >> > on the broken node been moved to the new node >> > I have done some test on my test cluster, and it seemed works, >> > but I'm not quite sure it's right in theory, so any comments will be >> > appreciated. >> > Thanks. >> > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com Global OnLine Japan/Fusion Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Ding Dinghua -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140905/4d89271d/attachment.htm>