Re: disk controller failure

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Fri, 14 Dec 2018 13:10:11 +0100

On 12/14/18 1:44 AM, Christian Balzer wrote:
> On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote:
> 
>> On 13.12.2018 18:19, Alex Gorbachev wrote:
>>> On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
>>> <dietmar.rieder@xxxxxxxxxxx> wrote:  
>>>> Hi Cephers,
>>>>
>>>> one of our OSD nodes is experiencing a Disk controller problem/failure
>>>> (frequent resetting), so the OSDs on this controller are flapping
>>>> (up/down in/out).
>>>>
>>>> I will hopefully get the replacement part soon.
>>>>
>>>> I have some simple questions, what are the best steps to take now before
>>>> an after replacement of the controller?
>>>>
>>>> - marking down and shutting down all osds on that node?
>>>> - waiting for rebalance is finished
>>>> - replace the controller
>>>> - just restart the osds? Or redeploy them, since they still hold data?
>>>>
>>>> We are running:
>>>>
>>>> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
>>>> (stable)
>>>> CentOS 7.5
>>>>
>>>> Sorry for my naive questions.  
>>> I usually do ceph osd set noout first to prevent any recoveries
>>>
>>> Then replace the hardware and make sure all OSDs come back online
>>>
>>> Then ceph osd unset noout
>>>
>>> Best regards,
>>> Alex  
>>
>>
>> Setting noout prevents the osd's from re-balancing.  ie when you do a 
>> short fix and do not want it to start re-balancing, since you know the 
>> data will be available shortly.. eg a reboot or similar.
>>
>> if osd's are flapping you normally want them out of the cluster, so they 
>> do not impact performance any more.
>>
> I think in this case the question is, how soon is the new controller going
> to be there?
> If it's soon and/or if rebalancing would severely impact the cluster
> performance, I'd set noout and then shut the node down, stopping both the
> flapping and preventing data movement. 
> Of course if it's a long time to repairs and/or a small cluster (is there
> even enough space to rebalance a node worth of data?) things may be
> different.
> 
> I always set "mon_osd_down_out_subtree_limit = host" (and monitor things
> of course) since I reckon a down node can often be brought back way faster
> than a full rebalance.

Thanks Christian for this comment and suggestion.

I think setting noout and shutdown the node is a good option, because
rebalancing would mean that ~22TB of data has to be moved.
However the spare part seems to be delayed, so I'm affraid I'lll not get
it before Monday.

Best
  Dietmar

> 
> Regards,
> 
> Christian
>>
>> kind regards
>>
>> Ronny Aasen
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com