Re: disk controller failure

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Thu, 13 Dec 2018 19:44:30 +0100

On 13.12.2018 18:19, Alex Gorbachev wrote:
On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
<dietmar.rieder@xxxxxxxxxxx> wrote:
Hi Cephers,

one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).

I will hopefully get the replacement part soon.

I have some simple questions, what are the best steps to take now before
an after replacement of the controller?

- marking down and shutting down all osds on that node?
- waiting for rebalance is finished
- replace the controller
- just restart the osds? Or redeploy them, since they still hold data?

We are running:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable)
CentOS 7.5

Sorry for my naive questions.
I usually do ceph osd set noout first to prevent any recoveries

Then replace the hardware and make sure all OSDs come back online

Then ceph osd unset noout

Best regards,
Alex

Setting noout prevents the osd's from re-balancing.  ie when you do a 
short fix and do not want it to start re-balancing, since you know the 
data will be available shortly.. eg a reboot or similar.

if osd's are flapping you normally want them out of the cluster, so they 
do not impact performance any more.

kind regards

Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com