Natuilus: Taking out OSDs that are 'Failure Pending'

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Fri, 4 Aug 2023 09:44:57 -0400

Hello.  It's been a while.  I have a Nautilus cluster with 72 x 12GB HDD
OSDs (BlueStore) and mostly of EC 8+2 pools/PGs.  It's been working great -
some nodes went nearly 900 days without a reboot.

As of yesterday I found that I have 3 OSDs with a Smart status of 'Pending
Failure'.  New drives are ordered and will be here next week.  There is a
procedure in the documentation for replacing an OSD, but I can't do that
directly until I receive the drives.

My inclination is to mark these 3 OSDs 'OUT' before they crash completely,
but I want to confirm my understanding of Ceph's response to this.  Mainly,
given my EC pools (or replicated pools for that matter), if I mark all 3
OSD out all at once will I risk data loss?

If I have it right, marking an OSD out will simply cause Ceph to move all
of the PG shards from that OSD to other OSDs, so no major risk of data
loss.  However, if it would be better to do them one per day or something,
I'd rather be safe.

I also assume that I should wait for the rebalance to complete before I
initiate the replacement procedure.

Your thoughts?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx