Re: disk controller failure

Christian Balzer <chibi@xxxxxxx> · Fri, 14 Dec 2018 09:44:12 +0900

On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote:

> On 13.12.2018 18:19, Alex Gorbachev wrote:
> > On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
> > <dietmar.rieder@xxxxxxxxxxx> wrote:  
> >> Hi Cephers,
> >>
> >> one of our OSD nodes is experiencing a Disk controller problem/failure
> >> (frequent resetting), so the OSDs on this controller are flapping
> >> (up/down in/out).
> >>
> >> I will hopefully get the replacement part soon.
> >>
> >> I have some simple questions, what are the best steps to take now before
> >> an after replacement of the controller?
> >>
> >> - marking down and shutting down all osds on that node?
> >> - waiting for rebalance is finished
> >> - replace the controller
> >> - just restart the osds? Or redeploy them, since they still hold data?
> >>
> >> We are running:
> >>
> >> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> >> (stable)
> >> CentOS 7.5
> >>
> >> Sorry for my naive questions.  
> > I usually do ceph osd set noout first to prevent any recoveries
> >
> > Then replace the hardware and make sure all OSDs come back online
> >
> > Then ceph osd unset noout
> >
> > Best regards,
> > Alex  
> 
> 
> Setting noout prevents the osd's from re-balancing.  ie when you do a 
> short fix and do not want it to start re-balancing, since you know the 
> data will be available shortly.. eg a reboot or similar.
> 
> if osd's are flapping you normally want them out of the cluster, so they 
> do not impact performance any more.
> 
I think in this case the question is, how soon is the new controller going
to be there?
If it's soon and/or if rebalancing would severely impact the cluster
performance, I'd set noout and then shut the node down, stopping both the
flapping and preventing data movement. 
Of course if it's a long time to repairs and/or a small cluster (is there
even enough space to rebalance a node worth of data?) things may be
different.

I always set "mon_osd_down_out_subtree_limit = host" (and monitor things
of course) since I reckon a down node can often be brought back way faster
than a full rebalance.

Regards,

Christian
> 
> kind regards
> 
> Ronny Aasen
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com