On 12/14/18 1:44 AM, Christian Balzer wrote: > On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote: > >> On 13.12.2018 18:19, Alex Gorbachev wrote: >>> On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder >>> <dietmar.rieder@xxxxxxxxxxx> wrote: >>>> Hi Cephers, >>>> >>>> one of our OSD nodes is experiencing a Disk controller problem/failure >>>> (frequent resetting), so the OSDs on this controller are flapping >>>> (up/down in/out). >>>> >>>> I will hopefully get the replacement part soon. >>>> >>>> I have some simple questions, what are the best steps to take now before >>>> an after replacement of the controller? >>>> >>>> - marking down and shutting down all osds on that node? >>>> - waiting for rebalance is finished >>>> - replace the controller >>>> - just restart the osds? Or redeploy them, since they still hold data? >>>> >>>> We are running: >>>> >>>> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous >>>> (stable) >>>> CentOS 7.5 >>>> >>>> Sorry for my naive questions. >>> I usually do ceph osd set noout first to prevent any recoveries >>> >>> Then replace the hardware and make sure all OSDs come back online >>> >>> Then ceph osd unset noout >>> >>> Best regards, >>> Alex >> >> >> Setting noout prevents the osd's from re-balancing. ie when you do a >> short fix and do not want it to start re-balancing, since you know the >> data will be available shortly.. eg a reboot or similar. >> >> if osd's are flapping you normally want them out of the cluster, so they >> do not impact performance any more. >> > I think in this case the question is, how soon is the new controller going > to be there? > If it's soon and/or if rebalancing would severely impact the cluster > performance, I'd set noout and then shut the node down, stopping both the > flapping and preventing data movement. > Of course if it's a long time to repairs and/or a small cluster (is there > even enough space to rebalance a node worth of data?) things may be > different. > > I always set "mon_osd_down_out_subtree_limit = host" (and monitor things > of course) since I reckon a down node can often be brought back way faster > than a full rebalance. Thanks Christian for this comment and suggestion. I think setting noout and shutdown the node is a good option, because rebalancing would mean that ~22TB of data has to be moved. However the spare part seems to be delayed, so I'm affraid I'lll not get it before Monday. Best Dietmar > > Regards, > > Christian >> >> kind regards >> >> Ronny Aasen >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 Fax: +43 512 9003 73100 Email: dietmar.rieder@xxxxxxxxxxx Web: http://www.icbi.at
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com