On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote: > On 13.12.2018 18:19, Alex Gorbachev wrote: > > On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder > > <dietmar.rieder@xxxxxxxxxxx> wrote: > >> Hi Cephers, > >> > >> one of our OSD nodes is experiencing a Disk controller problem/failure > >> (frequent resetting), so the OSDs on this controller are flapping > >> (up/down in/out). > >> > >> I will hopefully get the replacement part soon. > >> > >> I have some simple questions, what are the best steps to take now before > >> an after replacement of the controller? > >> > >> - marking down and shutting down all osds on that node? > >> - waiting for rebalance is finished > >> - replace the controller > >> - just restart the osds? Or redeploy them, since they still hold data? > >> > >> We are running: > >> > >> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous > >> (stable) > >> CentOS 7.5 > >> > >> Sorry for my naive questions. > > I usually do ceph osd set noout first to prevent any recoveries > > > > Then replace the hardware and make sure all OSDs come back online > > > > Then ceph osd unset noout > > > > Best regards, > > Alex > > > Setting noout prevents the osd's from re-balancing. ie when you do a > short fix and do not want it to start re-balancing, since you know the > data will be available shortly.. eg a reboot or similar. > > if osd's are flapping you normally want them out of the cluster, so they > do not impact performance any more. > I think in this case the question is, how soon is the new controller going to be there? If it's soon and/or if rebalancing would severely impact the cluster performance, I'd set noout and then shut the node down, stopping both the flapping and preventing data movement. Of course if it's a long time to repairs and/or a small cluster (is there even enough space to rebalance a node worth of data?) things may be different. I always set "mon_osd_down_out_subtree_limit = host" (and monitor things of course) since I reckon a down node can often be brought back way faster than a full rebalance. Regards, Christian > > kind regards > > Ronny Aasen > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com