Procedure for temporary evacuation and replacement

Frank Schilder <frans@xxxxxx> · Thu, 10 Oct 2024 08:12:42 +0000

Hi all,

a hopefully simple question this time. I would like a second opinion on a procedure for replacing a larger number of disks.

We need to replace about 40 disks distributed over all 12 hosts backing a large pool with EC 8+3. We can't do it host by host as it would take way too long (replace disks per host and let recovery rebuild the data). Therefore, we would like to evacuate all data from these disks simultaneously and with as little data movement as possible. This is the procedure that seems to do the trick:

1.) For all OSDs: ceph osd reweight ID 0  # Note: not "osd crush reweight"
2.) Wait for rebalance to finish
3.) Replace disks and deploy OSDs with the same IDs as before per host
4.) Start OSDs and let rebalance back

I tested step 1 on Octopus with 1 disk and it seems to work. The reason I ask is that step 1 actually marks the OSDs as OUT. However, they are still UP and I see only misplaced objects, not degraded objects. It is a bit counter-intuitive, but it seems that UP+OUT OSDs still participate in IO.

Because it is counter-intuitive, I would like to have a second opinion. I have read before that others reweight to something like 0.001 and hope that this flushes all PGs. I would prefer not to rely on hope and a reweight to 0 apparently is a valid choice here, leading to a somewhat weird state with UP+OUT OSDs.

Problems that could arise are timeouts I'm overlooking that will make data chunks on UP+OUT OSDs unavailable after some time. I'm also wondering if UP+OUT OSDs participate in peering in case there is an OSD restart somewhere in the pool.

Thanks for your input and best regards!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx